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Executive Summary 




The rapid adoption of information technology 
and ubiquitous networking has transformed 
the research and education landscape. Central 
to this transformation are scientific and engi- 
neering digital data collections. The life cycle 
management challenges associated with these 
intellectual assets are substantial. 

This is a report of a two-day workshop that 
examined the role of research and academic li- 
braries with other partners in the stewardship 
of scientific and engineering digital data. Work- 
shop participants explored issues concerning 
the need for new partnerships and collabora- 
tions among domain scientists, librarians, and 
data scientists to better manage digital data col- 
lections; necessary infrastructure development 
to support digital data; and the need for sustain- 
able economic models to support long-term 
stewardship of scientific and engineering digi- 
tal data for the nation's cyberinfrastructure. 

The workshop builds on prior studies sup- 
ported by the National Science Foundation 
(NSF), engaging numerous research communi- 
ties. It reflects the recognition, voiced in many 
NSF workshop reports, that digital data stew- 
ardship is fundamental to the future of scientif- 
ic and engineering research and the education 
enterprise, and hence to innovation and com- 
petitiveness. Overall, it is clear that an ecology 
of institutional arrangements among individu- 
als and organizations, sharing an infrastruc- 
ture, will be required to address the particulari- 
ties of heterogeneous digital data and diverse 



scholarly and professional cultures. 

The background of the workshop is de- 
scribed in Chapter I. Descriptions of the discus- 
sions of the three major topics from the three 
breakout groups and in plenary sessions are 
provided in Chapters II, III, and IV, and Chap- 
ter V discusses additional topics raised in the 
plenary sessions and final recommendations. 
Summary findings and final recommendations 
are presented below. 

Findings 

• The ecology of digital data reflects a dis- 
tributed array of stakeholders, institutional 
arrangements, and repositories, with a vari- 
ety of policies and practices. 

• The scale of the challenge regarding the 
stewardship of digital data requires that re- 
sponsibilities be distributed across multiple 
entities and partnerships that engage insti- 
tutions, disciplines, and interdisciplinary 
domains. 

• Historically, universities have played a 
leadership role in the advancement of 
knowledge and shouldered substantial 
responsibility for the long-term preserva- 
tion of knowledge through their university 
libraries. An expanded role for some re- 
search and academic libraries and univer- 
sities, along with other partners, in digital 
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data stewardship is a topic for critical de- 
bate and affirmation. 

• Responsibility for the stewardship of digital 
information should be vested in distributed 
collections and repositories that recognize 
the heterogeneity of the data while ensur- 
ing the potential for federation and interop- 
erability 

• Stakeholder groups have different expertise, 
outlooks, assumptions, and motivations 
about the use of data. Forging partnerships 
will require transcending and reconciling 
cultural differences. Collaboration models 
to share expertise and resources will be crit- 
ical. 

• Stewardship of digital resources involves 
both preservation and curation. Preserva- 
tion entails standards-based, active man- 
agement practices that guide data through- 
out the research life cycle, as well as ensure 
the long-term usability of these digital re- 
sources. Curation involves ways of organiz- 
ing, displaying, and repurposing preserved 
data. 

• Infrastructure for digital data resources is a 
shared common good and the digital data 
produced through federally funded re- 
search is a public good. 

• The stewardship and sharing of digital data 
produced by members of the research and 
education communities requires sustain- 
able models of technical and economic sup- 
port. 

• There is a need for a close linking between 
digital data archives, scholarly publica- 
tions, and associated communication. The 
potential for an expanded role for research 



libraries in the area of digital data steward- 
ship affords opportunities to address these 
important linkages. 

• A change in both the culture of federal 
funding agencies and of the research en- 
terprise regarding digital data stewardship 
is necessary if the programs and initiatives 
that support the long-term preservation, 
curation, and stewardship of digital data 
are to be successful. 

• It is critically important that NSF and other 
funding agencies raise awareness and meet 
the needs of the research community for the 
stewardship and sharing of digital data. 

Recommendations from the Workshop 

Overarching Recommendation 

NSF should facilitate the establishment of a 
sustainable framework for the long-term stew- 
ardship of data. This framework should involve 
multiple stakeholders by: 

• Supporting the research and development re- 
quired to understand , model , and prototype the 
technical and organizational capacities need- 
ed for data stewardship , including strategies 
for long-term sustainability , and at multiple 
scales; 

• Supporting training and educational pro- 
grams to develop a new workforce in data sci- 
ence both within NSF and in cooperation with 
other agencies; and 

• Developing , supporting , and promoting educa- 
tional efforts to effect change in the research 
enterprise regarding the importance of the 
stewardship of digital data produced by all sci- 
entific and engineering disciplines/domains. 



12 • To Stand the Test of Time 







Three general recommendations emerged 
around the following themes. 

NSF should: 

1. Fund projects that address issues concern- 
ing ingest, archiving, and reuse of data by 
multiple communities. Promote collaboration 
and " intersections " between a variety of stake- 
holders, including research and academic librar- 
ies, scholarly societies, commercial partners, sci- 
ence, engineering, and research domains, evolv- 
ing information technologies, and institutions. 

2. Foster the training and development of a 
new workforce in data science. This could 
include support for new initiatives to train in- 
formation scientists, library professionals, sci- 
entists, and engineers to work knowledgeably 
on data stewardship projects. 

3. Support the development of usable and use- 
ful tools, including 

• automated services which facilitate understand- 
ing and manipulating data; 

• data registration; 

• reference tools to accommodate ongoing docu- 
mentation of commonly used terms and con- 
cepts; 

• automated metadata creation; and 

• rights management and other access control 
considerations. 

These general recommendations and themes 
are amplified by the following targeted recom- 
mendations. 

1. NSF should develop a program to fund proj- 



ects/case studies for digital data stewardship 
and preservation in science and engineering. 
Funded awards should involve collaborations 
between research and academic libraries, scien- 
tific/research domains, extant technologies bas- 
es, and other partners. Multiple projects should 
be funded to experiment with different models. 

2. NSF, with other partners such as the Institute 
of Museum and Library Services and schools 
of library and information science, should sup- 
port training initiatives to ensure that informa- 
tion and library professionals and scientists can 
work more credibly and knowledgeably on data 
stewardship— data curation, management, and 
preservation— as members of research teams. 

3. NSF should support the development of usable 
and useful tools and automated services (e.g., 
metadata creation, capture, and validation) 
which make it easier to understand and manipu- 
late digital data. Incentives should be developed 
which encourage community use. 

4. Economic and social science experts should be 
involved in developing economic models for sus- 
tainable digital data stewardship. Research in 
these areas should ultimately generate models 
which could be tested in practice in a diversity 
of scientific/research domains over a reasonable 
period of time in multiple projects. 

5. NSF should require the inclusion of data man- 
agement plans in the proposal submission pro- 
cess and place greater emphasis on the suitabil- 
ity of such plans in the proposal's review. A data 
management plan should identify if the data are 
of broader interest; if there are constraints on 
potential distribution, and if so, the nature of 
the constraint; and, if relevant, the mechanisms 
for distribution, life cycle support, and preser- 
vation. Reporting on data management should 
be included in interim and final reports on NSF 
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awards. Appropriate training vehicles and tools 
should be provided to ensure that the research 
community can develop and implement data 
management plans effectively. 

6. NSF should encourage the development of data 
sharing policies for programs involving com- 



munity data. Discussion of mechanisms for de- 
veloping such plans could be included as part of 
a proposal's data management plan. In addition, 
NSF should strive to ensure that all data shar- 
ing policies be available and accessible to the 
public. 



M Introduction: Five Years and Five Centu- 



This is a report of a workshop that examined 
the new partnerships, infrastructure and sus- 
tainable economic models required to support 
long-term curation and management of scientif- 
ic and engineering data as a critical component 
of the nation's cyberinfrastructure. It builds 
upon five years of careful thought and analysis 
by many research communities and reflects a 
concerted effort by scientists and librarians to 
evolve ways of collaborating to achieve the crit- 
ical goal of digital data stewardship. This over- 
arching goal is fundamental to the future of the 
scientific and engineering research enterprise 
and hence to innovation and competitiveness. 

We are living through a revolution in the 
conduct of science and engineering, enabled by 
advances in computing and information tech- 
nologies. It goes without saying that evidence 
based on both theory and data is fundamental 
to research. Reconciling the tension between 
theory and observation forms one of the ma- 
jor themes in the Scientific Revolution in the 
West, which may be said to have begun with 
the publication of Copernicus' De Revolutioni- 
bus in 1543. 1 

In our own time, many have pointed to 
computational and information technologies as 
having led to an advance in the scientific meth- 
od through techniques such as simulation and 
visualization. Large-scale investigations such as 
those in genomic sequencing and protein fold- 
ing and astronomical sky surveys have created 
data sets of a magnitude and granularity well 



beyond what might have been accommodated 
by paper and analog photography. Likewise, 
large databases have supported disparate and 
highly heterogeneous data for studies of his- 
tory and culture, climate, geography, ecology 
and weather. In addition, expanding digitally 
based communication systems permits remote 
analysis and collaboration by distributed teams 
of investigators and new forms of dissemina- 
tion within and across disciplines as well as to 
the public (Figure 1-1). But with the reliance on 
digital data for scientific and engineering re- 
search, and the likelihood that such collections 
will proliferate, comes the need to manage the 
data, both to support the verification of pub- 
lished findings as well as to enable re-use of 
collections. Hence the need for sustainable eco- 
nomic and organizational models to support 
stewardship of digital data. 

The National Science Foundation (NSF) has 
played, and continues to play, a major role in 
this intellectual revolution from supporting the 
Internet and supercomputing centers to spon- 
soring basic research. Over the past five years, 
the agency has convened numerous work- 
shops 2 across the scientific disciplines to exam- 
ine the notion of a "cyberinfrastructure," that 
is, the integrated "hardware for computing, 
data and networks, digitally-enabled sensors, 
observatories and experimental facilities, and 
an interoperable suite of software and middle- 
ware services and tools." 3 The Foundation rec- 
ognizes that the implications of providing this 
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infrastructure extend beyond physical facilities 
and software tools: 

Investments in interdisciplinary teams and 
cyberinfrastructure professionals with ex- 
pertise in algorithm development, system 
operations, and applications development 
are also essential to exploit the full power of 
cyberinfrastructure to create, disseminate, 
and preserve scientific data, information 
and knowledge. 4 

These collaborations are perceived broadly to 
encompass stakeholders across government, 
the private sector, higher education and the 
research enterprise in the United States and 



internationally A frequent theme in many of 
the workshop reports, as well as in the pre- 
viously cited vision statement from the NSF 
Cyberinfrastructure Council, is the notion of 
data and its preservation. For example, as early 
as May 2001, NSF and the Office of Naval Re- 
search sponsored a workshop on marine geolo- 
gy and geophysics in La Jolla to examine issues 
related to data management. 5 The outcomes 
and recommendations are prescient in that 
they call for coordination of distributed centers 
(rather than centralization) and collaboration 
among the investigators and those who manage 
the collections. Thus, the challenges of manag- 
ing highly heterogeneous digital data arise not 
only from the different disciplines and under- 
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Working with Data: Data Driving New Discoveries in Research and Education 
Source: F. Berman, Preserving Digital Collections for Research and Education, September 26, 2006 
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lying heterogeneity of the data — from sensors 
to censuses — but also from the different creator 
and user communities. 

Prior workshop participants have grappled 
with these issues at some length, examining 
how cyberinfrastructure allows them to ad- 
vance their research objectives as well as how 
best to manage the infrastructure itself. 6 On the 
basis of the past five years of study, the NSF 
Cyberinfrastructure Council has set a two-fold, 
five-year goal for data, data analysis, and visu- 
alization: 

o To catalyze the development of a system 
of science and engineering data collections 
that is open, extensible, and evolvable; and 

o To support development of a new genera- 
tion of tools and services facilitating data 
mining, integration, analysis, and visual- 
ization essential for turning data into new 
knowledge and understanding. 7 

'The resulting national digital data frame- 
work," the Council's report continues, "will 
consist of a range of data collections and man- 
aging organizations, networked together in a 
flexible technical architecture using standard 
open protocols and interfaces, and designed to 
contribute to the emerging global information 
commons." As envisioned in the report, the na- 
tional data framework will: 

o Promote interoperability between data col- 
lections supported and managed by a range 
of organizations and organization types; 

o Provide for appropriate protections and 
reliable long-term preservation of digital 
data; 

o Deliver computational performance, data 
reliability and movement through shared 
tools, technologies and services; and 



o Accommodate individual community pref- 
erences. 8 

This requires an over-arching coherent organi- 
zational framework engaging both collections 
and managing organizations, a flexible techni- 
cal architecture, and coherent data policies. 

The consequences for the research enter- 
prise are profound. As Hey and Hey have re- 
cently observed in their paper on the "e-Science 
revolution" and its implications for stakehold- 
ers, including libraries: 

Increasingly academics will need to col- 
laborate in multidisciplinary teams distrib- 
uted across several sites in order to address 
the next generation of scientific problems. 
In addition, new high-throughput devices, 
high-resolution surveys and sensor net- 
works will result in an increase in scientific 
data collected by several orders of magni- 
tude. To analyze, federate and mine this 
data will require collaboration between sci- 
entists and computer scientists; to organize, 
curate and preserve this data will require 
collaboration between scientists and librar- 
ians. 9 

And, they continue, "A vital part of the devel- 
oping research infrastructure will be digital 
repositories containing both publications and 
data." Thus, the transformation extends from 
the investigation, encompassing data collection 
and analysis, through communication of those 
results and the organizational settings in which 
these functions will be performed. 

Data in Organizations 

Curation and preservation (Box 1-1) in the an- 
alog world was largely handled by libraries, 
archives, and museums, some specialized and 
others more general. The system was robust in 
the sense that there was substantial overlap and 
complexity in the underlying business models 
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Box 1-1 . Stewardship, Preservation, and Curation 
Charles Humphrey, University of Alberta 



Stewardship of digital resources involves both preservation and curation. What is the 
distinction between these terms? Are these concepts describing the same thing? If not, how 
do curation and preservation differ? Both concepts have been borrowed from other fields 
and applied to the realm of research data. Curation has its roots in museum management, 
while preservation traces its origins to archivists. The Digital Curation Centre (DCC) in the 
U.K. defines digital curation as "maintaining and adding value to a trusted body of digital 
information ." 10 DCC documents frequently make reference to "curation and preservation." 
That is, they treat these concepts as functionally different. 

What are the functions that one would attribute to curation that differ from 
preservation? Preservation consists of (a) the management practices based on standards 
that guide and build metadata and data throughout the research life cycle and of (b) the 
subsequent long-term care for these digital products. The outputs of (a) are copies of the 
metadata and data in discipline-acknowledged standards best suited for (b), their long- 
term care, access, migration and refreshment. 

Curation involves ways of organizing, displaying, and repurposing preserved data 
collections. Along the lines of the DCC definition, curation functions add value to a 
collection of preserved data by organizing and displaying the data through analyses of the 
collection's metadata or through the creation of new data from the preserved collection. 

From the perspective of the life cycle of research data , 11 preservation occurs through 
the stages of data production and the creation of research outputs represented on the 
bottom row. Long-term preservation in this model consists of the practices followed 
in caring for the data, which is represented by the box on the right side of the figure. 

Data curation is characterized on the top row by the stages of data discovery and data 
repurposing, which make use of the preserved data. The activities of these two functions 
bring new value to the collection through analyses of the metadata, which display 
aspects of the collection in new light, and the creation of new data from the existing data 
collection. 
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and sources of support. However, the costs 
were frequently absorbed by organizations and 
entities that did not create or actually use the 
information. 

Preservation of digital data, on the other 
hand, with its requirements for active man- 
agement has forced a re-examination within 
the library /archives community of inherited 
assumptions about responsibility, use, over- 
sight, and cost. Thus, long-term preservation 
and curation are understood not merely as 
preservation of bits and the ability to decode 
them but also as a system that requires both 
cooperation across a diversity of organizations, 
uses, and stakeholders and sustainable models 
of technical and economic support. In a nut- 
shell, preservation is an organizational as well 
as a technical challenge and the responsibility, 
as has been widely recognized, spans a broad 
range of stakeholders. 12 

According to the definition of digital data 
adopted by the NSF Cyberinfrastructure Coun- 
cil, data are partitioned into three major catego- 
ries: research collections, which are generally 
individualized, project-based, and perhaps not 
even candidates for long-term preservation; re- 
source collections, which are community-based 
and mid- to long-term in anticipated longevity; 
and reference collections, which serve large 
segments of the research community, conform 
to agreed-upon standards, and require sub- 
stantial and very long-term support. 13 

Chapter III of the draft report acknowl- 
edges the need for organizational and techni- 
cal frameworks that reflect the diversity of data 
and of managing entities that can evolve as the 
data and technology evolve. These entities must 
maintain sufficient consistency, coherence, and 
interoperability to enable wide present and fu- 
ture use. More specifically, Berman has mapped 
a tri-partite data framework to the pyramid first 
proposed by Lewis Branscomb, enabling her to 
devise a data pyramid that links facilities, com- 
munities, and collections (Figure 1-2). 



The boundaries, particularly between the 
middle and top levels are understandably fluid. 
But, she points out, the framework allows fund- 
ing sources, particularly public sources, to be 
targeted appropriately. Thus, Berman argues, 
commercial interests might serve the needs of 
small collections and " future NSF researchers 
and educators may request budgets for project- 
oriented storage services in the same way they 
currently request budgets for project-oriented 
personal computers." The middle level often 
involves partial federal investment, but invites 
"creative private / public / academic partner- 
ships" like katrinasafe.com, a joint effort in- 
volving the International Red Cross, Microsoft, 
SDSC, and others. Finally, the top level, where 
national collections required to advance re- 
search and education are categorized, requires 
federal and sustained investment, again, in 
partnership with universities and other orga- 
nizations. Even in this case, however, Berman 
outlines a case for interagency support and 
multiple funding strategies. 14 

Berman's model is an example of one way 
to parse the substantial challenge of massive, 
highly heterogeneous collections, diverse com- 
munities of creators and users, rapidly evolving 
technologies, and disparate organizations. Data 
are collected across a wide range of instrumen- 
tation and methodologies as well as across dif- 
ferent disciplinary cultures. Consistently, those 
engaged in preservation of digital information 
have called for distributed yet federated collec- 
tions that recognize heterogeneity yet preserve 
high-level coherence and support interoper- 
ability. Managing that tension is a fundamen- 
tal requirement for the technical infrastructure 
and mandates partnerships across a range of 
stakeholder groups. 15 

To this effort requiring fusion of individu- 
als, organizations, technology, and collections, 
libraries bring to bear not only their experience 
in managing physical collections but also their 
long experience building partnerships and 
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meeting the needs of diverse user communi- 
ties. Indeed, the academic and research library 
community constitutes an information and so- 
cial infrastructure that can be leveraged to sup- 
port the needs of managing appropriate parts 
of this information infrastructure. Thus, while 
the workshop participants repeatedly called for 
active participation by scientists in the manage- 
ment of the data, creative partnerships between 
librarians and scientists will be critical for pro- 
fessional data stewardship in future science and 
engineering efforts. This workshop, therefore, 
assembled a range of investigators and librar- 
ians, and this report summarizes their delibera- 
tions and their recommendations. 

Diversity in the Data Landscape 

The data generated and used to support sci- 
ence and engineering research exist in many 



forms. In order for digital data to be analyzed, 
managed, curated, shared, or preserved, it is 
necessary to have a contextual understanding 
of the data concerning its capture, acquisition, 
or generation. Workshop participants acknowl- 
edged the importance of understanding the en- 
vironment in which data are gathered and used 
in order to address how support for the digital 
data might be sustained over the long term. 

Each discipline or sub-discipline has its 
own set of data characteristics and the type 
of research project will determine the level of 
complexity of the data. Some of the character- 
istics include the nature of the data (e.g., raw 
numbers such as time, position, temperature, 
calibration; images; audio or video streams; 
models; simulations or visualizations; software; 
algorithms; equations; animations), whether 
they can be reproduced, and whether they have 
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been processed or interpreted by some means. 

The size of the project also determines the 
heterogeneity of the data. Small projects can be 
managed with raw data in simple flat files and 
coding familiar only to the project investigator. 
Large projects must agree on data standards 
and curation procedures to make any prog- 
ress. 

However, the convergence of communica- 
tion and computing technologies is providing 
new methodologies through which researchers 
gather and share larger amounts of instrument- 
ed and captured data. David Messerschmitt 
characterized what he called digital science as 
having "five complementary elements: 

o Collection of data from the physical world 
(using distributed sensors and instru- 
ments); 

o Distributed, organized repositories of such 
data; 

o Computation using theoretical models and 
experimental data; 

o Presentation of results for scientific visual- 
ization and interpretation; and 

o Support for collaboration among scien- 
tists." 

He goes on to write, "[djigital science and en- 
gineering research often involves close coordi- 
nation of theory, experiment, and collaboration 
among digital scientists, so geographically dis- 
tributed collaboration and access to geographi- 
cally distributed sensor networks and instru- 
mentation are crucial. Much of digital science 
is conducted by authoring (or in many cases 
executing existing) discipline-specific and ge- 
neric software that automates data collection 
and capture, computational models and data 
analysis, visualization of the results, and col- 



laboration. Software is a primary tool of a digi- 
tal scientist, just as microscopes and telescopes 
and pencil and paper are tools of experimental 
and theoretical scientists." 16 

As projects become larger and more com- 
plex, the data collected is several orders of mag- 
nitude higher and increases the need for coor- 
dination and data management. Databases and 
specially designed repositories are established 
to collect and make available the, in some cas- 
es, terabytes or petabytes of raw data gathered 
from such methods as sensor networks, satel- 
lite surveys, or supercomputer simulators. 

According to Long-Lived Digital Data Collec- 
tions: Enabling Research and Education in the 21 st 
Century , "[d]ata can also be distinguished by 
their origins — whether they are observational, 
computational, or experimental. Observation- 
al data, such as direct observations of ocean 
temperature on a specific date, the attitude of 
voters before an election, or photographs of a 
supernova are historical records that cannot be 
recollected." 17 

Computational data are usually the result 
of executing a computer model or simulation 
and generate outputs that can be reproduced 
(provided the model and its associated descrip- 
tive information about the hardware, software, 
and input data are preserved). Experimental 
data includes such things as measurements 
of patterns of gene expression, chemical reac- 
tion rates, or engine performance. In order to 
share data, they must be preserved unless de- 
termined to be cost-effective to reproduce the 
experiment. 

The experimental process is the origin of 
another distinction, in this case between 
the intermediate data gathered during pre- 
liminary investigations and final data. Re- 
searchers may often conduct variations of 
an experiment or collect data under a va- 
riety of circumstances and report only the 
results they think are the most interesting. 
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Selected final data are routinely included in 
data collections, but quite often the inter- 
mediate data are either not archived or are 
inaccessible to other researchers. 

Processing and curatorial activities gen- 
erate derivative data. Initially, data may 
be gathered in raw form, for instance as a 
digital signal generated by an instrument 
or sensor. [These] raw data are frequently 
subject to subsequent stages of refinement 
and analysis, depending on the research 
objectives. There may be a succession of 
versions. While the raw data may be the 
most complete form, derivative data may 
be more readily usable by others . 18 

All of the work to analyze and process the 
data is domain or application specific through- 
out the life of the project, but at the project's 
completion those data and their context, includ- 
ing provenance, need to be clearly articulated if 
the data are to be retained or shared. 

Domains and applications often differ sig- 
nificantly, consequently, the increase in multi- 
disciplinary projects and the associated use of 
pre-existing data highlight the importance of 
explicit descriptive information about data to 
optimize the investment made in data acqui- 
sition or generation. In a digital environment, 
data exist as bits or bytes, but without context 
they cannot be used. The data need descriptive 
information, tools, and frameworks in order to 
be fully useful for repurposing. 

Although there are many similarities in the 
landscape of digital data, the specifics of man- 
aging data will depend significantly on the na- 
ture of the primary discipline with which the 
data are associated. An understanding of how 
the data are produced and structured for use 
prompts the methods available to curate and 
preserve the data for the long term. 



Heterogeneous Data and Systems, Interoperability, 
and Metadata 

As previously observed, scientific and engi- 
neering data sets are highly heterogeneous. 
This heterogeneity arises from multiple sourc- 
es: formats, size, collection technique, coding, 
use of instrumentation. In addition, the deliv- 
ery and storage formats and media are also 
variable — capturing data by satellite telemetry 
is fundamentally different from social science 
recorded from telephone surveys. Astronomi- 
cal imagery looks little like a genomic database 
and both differ substantially from census and 
polling data. Other collections are quite small 
but no less valuable. Moreover, the structure 
of the data sets reflects differences in the dis- 
ciplinary cultures in which they were created, 
including expectations about future use. Thus, 
the small scale project level collection of record- 
ings of indigenous language might never have 
been intended for inclusion in a larger anthro- 
pological reference collection yet the rarity of 
the small collection as well as the endangered 
language itself might change the circumstances 
of the long term use. As several participants 
pointed out, the confidentiality that surrounds 
respondents in a social science survey is not an 
issue in astronomical imagery but both require 
systems that ensure the long-term integrity of 
the data. Moreover, the career goals of indi- 
vidual investigators, who legitimately need to 
protect their use of the data for some period 
of time, must be reconciled with its potential 
long-term value to the community and to fu- 
ture researchers. 

This example, the tension between the 
prestige systems that surround individual in- 
vestigators and the long-term importance of 
a collection, is very familiar to archivists, in 
particular. It is not uncommon that a small, im- 
portant collection of important material must 
be integrated into a larger collection. Yet, the 
conditions under which the creator (or author 
or investigator) worked and the expectations 
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that future users — perhaps users a century 
off — may bring to bear are substantially dif- 
ferent. Some of these tensions can be resolved 
through embargoes on use of the collections or 
uses of some parts of the collections. However, 
digital data are fragile. Thus, the management 
of the data requires active engagement by the 
investigators as well as the information manag- 
ers. How this might work is open to discussion, 
but it is widely recognized that collaborations 
require recognition of the institutional prac- 
tices within which collections may have been 
created as well as the requirements imposed 
by management itself. Thus, Green and Gut- 
mann have argued a life cycle approach to data 
that understands reciprocal relationships and 
information flows between investigators and 
managers, enabling both sides to see where the 
hand-offs occur and how the cultural condi- 
tions of the respective environments can been 
respected and accommodated . 20 

At the technical level, designers of reposito- 
ry architectures, the umbrella term used to em- 
brace the conceptual structure of a data storage 
system and the components required to orga- 
nize, manage, and provide access to them, have 
focused on ways to make heterogeneous data 
and systems interoperate. The goal is not to 
compel homogeneity but rather to devise layers 
on top of the data in their native or "raw" form 
(perhaps as transmitted by the instrumentation 
or collected by the investigator). This "layer" 
is the metadata, or data about data. The most 
familiar example is the bibliographic record 
in a card catalog, but computational metadata 
is much more extensive and can include in- 
formation about formats, structure, computer 
language, operating system, experimental en- 
vironment, and so on. Thus, interoperability 
among data, collections, systems, and institu- 
tions is inevitably linked to metadata. 

Metadata standards are a critical part of 
digital information. Given what some have 
called a "deluge" of digital information, sub- 



stantial effort must be invested in automatic 
metadata creation, particularly when it is a by- 
product of the data collection workflow itself 
(as in astronomy, for example). Others have fo- 
cused on the role of investigators who might be 
incentivised to undertake initial metadata cre- 
ation according to an agreed upon community 
standard; both metadata and incentives for in- 
vestigators to participate in long-term curation 
systems are among the themes that permeate 
the position papers submitted by participants 
in this workshop. Addressing this topic is criti- 
cal since, as librarians and archivists have long 
argued, metadata in the digital environment 
must begin at creation precisely to address 
questions that might be most easily and accu- 
rately addressed at that point (for example, the 
nature of the instrumentation or the time of day 
for some types of data). In this regard, many 
involved in the creation of digital libraries and 
archives have pointed to the need for automatic 
metadata generation, which is, in many classes 
of scientific data (for example, sensor data or 
digital imagery), inherent in the process of data 
collection . 19 

Workshop Goals and Description 

Based on the framework established by prior 
work for the Office of Cyberinfrastructure, the 
workshop outlined three major goals: 

o To examine issues associated with sustain- 
able economic models for long-term preser- 
vation and curation of digital data and to ar- 
ticulate recommendations for further work, 
including identifying sources of funding; 

o To examine the structure of new partner- 
ships to facilitate seamless capture, process- 
ing, storage, and management of heteroge- 
neous scientific and engineering data; and 

o To examine the infrastructure requirements 
necessary to support long-term manage- 
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merit of digital data in distributed yet fed- 
erated collections, recognizing the rapid 
pace of technological change and the need 
for unfettered access. 

Threaded through these goals were a series of 

questions: 

o What role do the research and academic 
libraries envision for themselves and do 
scientists envision for librarians in a digital 
data framework that provides for preser- 
vation of digital publications, digital data, 
and the links between them? 

o What partnerships /coalitions among re- 
search and academic libraries and with 
other sectors (government, international, 
non-profit, and for-profit) could facilitate 
the creation of such a framework? 

o How do new and emerging technologies 
affect organizational roles and responsibili- 
ties? 

o Are there opportunities to test sustainable 
models for digital data preservation orga- 
nizations, including consortia and partner- 
ships? 

o What resources would be required to en- 
able such tests? 

These goals and questions prompted a work- 
shop structure based on three breakout groups 

with distinct charges: 

o Infrastructure. How do we manage digital 
data now and migrate it successfully over 
future generations of technologies, stan- 
dards, formats, and institutions? 

o Partnerships. What mix of individuals and 
organizations should be involved in digital 



data preservation? What creative partner- 
ships can be developed between the mul- 
tiple sectors? 

o Sustainable Economic Models. What mod- 
els are required to sustain digital data man- 
agement and preservation efforts over the 
long term? 

Thirty-two individuals participated in the in- 
vitation-only, two-day workshop, held at NSF 
headquarters in Ballston, Virginia. A complete 
list of participants and the agenda are included 
among the appendices to this report. In ad- 
vance of the meeting, workshop participants 
were asked to submit a brief statement describ- 
ing their top three issues within the themes of 
the workshop. These statements were posted 
to the workshop Web site and analyzed in ad- 
vance of the meeting as a point of departure for 
the deliberations. They are included in their en- 
tirety in Appendix E. 

Participants were divided into three break- 
out groups, which articulated the issues and 
formulated recommendations. Workshop co- 
chairs were encouraged to ask their groups to 
outline well-formed, actionable recommenda- 
tions and, where possible, to identify links with 
other topics and funding opportunities and 
implications. The first day was given over to 
plenary sessions and breakout group delibera- 
tions. On the second day, the breakout groups 
reconvened in plenary session to present their 
findings and entertain broad discussion. 

Roadmap to this Report 

This report is divided into three major chapters, 
each corresponding to the breakout groups. 
Each chapter is divided into three major sec- 
tions: a brief summary of the issues and ra- 
tionale for the topic, the discussions that sur- 
rounded the issues in the group and during the 
plenary session that followed on the second 
day, and then the recommendations in priority 
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order formulated by the breakout group. The 
final list of recommendations is presented in 
Chapter V, Summary, Conclusions, and Recom- 
mendations, together with a summary of the 
more general discussion. 
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II. Infrastructure 




The breakout group on infrastructure took its 
charge to be as follows: What capacities should 
we build now to manage and migrate data over 
future generations of technologies, standards, 
formats, and organizational stakeholders? 

This chapter summarizes the discussion of 
the issues undertaken by the breakout group 
and sets forth its recommendations. 

Discussion of the Issues 

As Berman pointed out in her introduction to 
the workshop, data management and preserva- 
tion have multiple layers that require vertical 
integration between layers and horizontal co- 
ordination across collections, disciplines, and 



institutions, all to support active use of the in- 
formation (Figure II-l). 

These functions require specific attention 
to the technical issues associated with interop- 
erability across heterogeneous networks, ser- 
vices, platforms, and data; metadata to support 
access, interoperability, and long-term man- 
agement and curation (including rights man- 
agement, security, privacy, and confidential- 
ity); and institutional policies to support these 
functions as well as collaborations among in- 
stitutions, disciplines, and investigator teams. 
Moreover, data preservation has distinctive re- 
quirements for resources, continuity, metrics of 
success, and funding: 
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o Resources: Archival storage, network, 

systems, replication and backup, staff for 
maintenance, database management, and 
user services. 

o Continuity: A key criterion. Data collections 
must migrate over new generations of tech- 
nology seamlessly, without loss of informa- 
tion or interruption in service. 

o Metrics of success: No serious loss of data, 
preservation of reference collections, ap- 
propriate research collections. 

o Funding model: Funding commitment 
needs to address long-term consistency 
needed for data collections. 

Within the breakout group, participants exam- 
ined their assumptions, requirements, and val- 
ues about the infrastructure — i.e., that shared 
layer that supports multiple and divergent col- 
lections and users within and across organiza- 
tions. They concluded: 

o Infrastructure is a shared common good; 
the data are a public good. 

o Infrastructure must: 

■ Support multiple representations of 
data by diverse users, now and in the 
future, representing diverse disciplin- 
ary cultures; 

■ Support new service layers and reuse of 
components through data models and 
metadata as well as institutional poli- 
cies governing use of collections and 
services; 

■ Support appropriate security systems, 
privacy, confidentiality; and 



■ Enable trusted relationships at multiple 
levels. 

Heterogeneity of data and collections and the 
constraints potentially associated with using 
those collections (for example, rights manage- 
ment) were considered a particular challenge as 
was the need to incentivise investigators both 
to contribute material and to undertake some 
of the critical processing activity (e.g., prepara- 
tion of metadata). Indeed, the question of how 
to construct appropriate incentives and the 
possible use of mandates struck a chord in the 
plenary session and will be discussed in more 
detail in Chapter V in the context of consensus 
recommendations. 

The OAIS Architecture 

The breakout group found the Open Archi- 
val Information System (OAIS) model 1 to be 
a useful mechanism for focusing and organiz- 
ing the discussion. The reference model as de- 
veloped over a number of years by the Space 
Science community and others has proven to 
be a useful mechanism for preservation within 
a number of communities. As pointed out by 
one participant, a key concept is the notion of 
a "designated community/' which explicitly 
acknowledges the institutional culture within 
which data are created and used. This meets 
a critical need: namely cross-cultural commu- 
nication among professions (library, archival, 
and research) institutions (libraries, archives, 
universities, professional societies, and so on) 
and disciplines (physical sciences, life sciences, 
computational sciences, and social sciences). 
Specifically, the notion of a designated commu- 
nity recognizes the specifics of a given disci- 
pline and its data requirements yet enables the 
community to conceptualize a common archi- 
val framework. 

Two elements in the reference architecture 
proved especially useful: the environment 
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within which the archive (or curatorial facil- 
ity) is located 2 (Figure II-2) and the major func- 
tions of the archive itself 3 (Figure II-3), which 
may be embodied in both centralized and dis- 
tributed organizational and computational ar- 
chitectures. The focus on function thus enables 
discussion of capacity, independent of the spe- 
cific institutional arrangements (the topic of the 
Partnerships breakout group) and recognizes 
that multiple instantiations may be possible. 

Four functions were especially relevant to 
exploring data curation functions: 



o Ingest: Which includes receiving informa- 
tion in a specified form, known as the "Sub- 
mission Information Packages;" quality as- 
surance; generating the form of the infor- 
mation that is actually stored in the archive, 
which is called the "Archival Information 
Packages;" extracting descriptive informa- 
tion; and coordinating updates. 

o Archival storage: Which include services to 
support storage, maintenance, and retrieval 
of the archived information. Functions in- 
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elude receiving the information that has 
been prepared for archiving, refreshing the 
media as needed, routine monitoring for 
errors, providing for disaster recovery, and 
fulfilling requests for access to the data. 

o Data management: Which includes the ser- 
vices and functions for population, main- 
taining and accessing the descriptive in- 
formation important for identifying and 
documenting the holdings and information 
necessary to manage the archive. Functions 
include database administration, perform- 
ing updates, and querying the management 
data to generate reports required to manage 
the archive. 

o Access: Which includes services and func- 
tions required to support users, enabling 
them to find information, including discov- 
ery, description, location, and availability. 

Two other functions, preservation planning 
and administration, are included in the model 
but did not generate substantial discussion in 
the breakout group. They are included here for 
completeness. 

o Preservation planning: Which includes ser- 
vices and functions required to monitor 
the environment (for example, format ob- 
solescence, standards, and so on), enabling 
managers to ensure that the information 
remains accessible. Planning functions in- 
clude developing migration plans and re- 
lated prototyping and implementation. 

o Administration: Which includes services 
and functions enabling overall operation of 
the archive. 

Finally, all of these functions are supported by 
a set of common services. These include: oper- 
ating systems, network services, and security 



(encompassing identification, authentication, 
access control, data confidentiality, data integ- 
rity, and non-repudiation). 

Discussion of the OAIS Framework 
Within this framework, the participants in the 
breakout group agreed that ingest and access 
were the critical functions to address. The is- 
sues include not simply what metadata are 
needed, but incentives to ensure the creation of 
necessary metadata. Storage, while important 
and challenging, did not present the same am- 
biguities and tensions associated with access or 
ingest. One participant observed, "Storage and 
computers are so cheap, I can buy several of 
them and then I can archive terabytes of data." 
Other disciplines like astronomy have far great- 
er storage requirements and the scale of stored 
data presents challenges to access. The group 
agreed that issues related to storage (in addi- 
tion to those identified in the OAIS reference 
implementation) include proliferation (storage 
is cheap, anyone can do it), scalability, hetero- 
geneity in media and formats, and architecture 
(centralized versus distributed). 

Metadata. Perhaps not surprisingly, metadata 
are critical not only to enabling access by con- 
sumers but also to the effective management 
of the data. Metadata schema are increasingly 
complex. For example, the schema for data 
preservation set forth by the National Library 
of New Zealand 4 has over 70 data elements, di- 
vided into four major categories or "entities" 
(Figure II-4). 

Incentives. The ideal of capturing metadata at 
the time of creation requires both techniques 
of automatic metadata creation as well as the 
active engagement of the investigators them- 
selves. The latter, in turn, requires both moti- 
vating the investigator (e.g., through increas- 
ing awareness of the imperative) as well as the 
development of tools to support the investiga- 
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tor's active participation. In addition, the OAIS 
model acknowledges that the object to be man- 
aged (the "package") might undergo transfor- 
mations, so that the package of information 
submitted by the investigator might look dif- 
ferent than the package that is managed inside 
the archive. Thus, the model does not require 
that the metadata that investigators might sub- 
mit meet the standard imposed by the archive. 
Rather, the investigators need only meet mini- 
mal standards. 

Repositories. Breakout group participants 
agreed that it is insufficient to outline incen- 
tives for investigators without ensuring that 
repository capacity is available to receive their 
data. Such repositories, largely for e-prints and 
publications, already exist in a number of dis- 
ciplines (see Box II-l) although their existence 
tends not to be well known outside of their re- 
spective communities. 



Such resources vary by discipline. In some 
research areas, one participant pointed out, ac- 
cess to shared repositories is central to the na- 
ture of the research. An obvious example is the 
Protein Data Bank (http: / / www.wwpdb.org/) 
which is an international collaboration and is 
sponsored in the USA by nine federal agen- 
cies. At the other extreme is the individualized 
approach characteristic of much social science 
research where data formats together with 
changes in the underlying phenomena (for ex- 
ample, geopolitical boundaries) may substan- 
tially complicate re-use of earlier data as well as 
justify a fresh round of data collection. Still, the 
ability to compare data over time, a hallmark 
of some questions in social research, would be 
greatly enhanced by continued access to and 
reuse of older data, not to mention potential 
efficiencies since it is generally known that 
data collection and cleaning can be extremely 
expensive. However, the requirements of man- 
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ENTITY 1 - OBJECT 



Name of object 
Reference number 
Identifier - Object IID 
Persistent identifier - PID 
UNIX location 




1 .6 Date of creation of preservation master 

1.7 Technical composition 

1 .8 Structural type 

1 .9 Hardware environment 

1.10 Software environment 

1.11 Installation requirements 

1.12 Access inhibitors 

1.13 Access facilitators 

1.14 Quirks 

1.15 Authentication 

1.16 Metadata record creator 

1.17 Date of metadata record creation 

1.18 Comments 



ENTITY 2 - PROCESS 



2.1 Object IID 

2.2 Process 

2.3 Purpose 

2.4 Business unit or agency 

2.5 Permission 

2.6 Date of permission 

2.7 Hardware used 

2.8 Software used 

2.9 Steps 

2.10 Result 

2.11 Guidelines 

2.12 Completion date and time 

2.13 Comments 



ENTITY 3 -FILE 

Common elements: 

3.1 Object IID 

3.2 File IID 

3.3 Structural context 

3.4 Filename and extension 

3.5 File size 

3.6 File date and time 

3.7 MIME type/format 

3.8 Version 

3.9 Target indicator 

3.10 Image: 

3.10.1 Resolution 

3.10.2 Dimensions 

3.10.3 Tonal resolution 

3.10.4 Colourspace 

3.10.5 Colour management 

3.10.6 Colour lookup table 

3.10.7 Orientation 

3.10.8 Compression 

3.11 Audio: 



3.11.1 Resolution 

3.11.2 Duration 

3.11.3 Bit rate 

3.11.4 Compression 

3.11.5 Encapsulation 

3.11.6 Track number and type 



ENTITY 3 - FILE (cont.) 



3.12 Video: 



3.12.1 Frame dimensions 

3.12.2 Duration 

3.12.3 Frame rate 

3.12.4 Compression 

3.12.5 Encoding structure 

3.12.6 Sound 

3.13 Text: 



3.13.1 Compression 

3.13.2 Character set 

3.13.3 Associated DTD 

3.13.4 Structural divisions 

3.14 Datasets: 

Uses common elements only 

3.15 System Files: 

Uses common elements only 

ENTITY 4 - METADATA MODIFICATION 

4.1 Object IID 

4.2 Modifier 

4.3 Date and time 

4.4 Field modified 

4.5 Data modified 



Figure II-4. 

The National Library of New Zealand Preservation Metadata Model 
Source: National Library of New Zealand. Metadata Standards Framework — Preservation 
Metadata (Revised), lune 2003, Appendix 1 
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aging each of these types of data may be sub- 
stantially different. Incentives may be useful in 
some cases, while insufficient in others. 

The Importance of Prototypes: The National Virtual 
Observatory 

Several of the position papers pointed to the 
importance of prototypes as a way of under- 
standing aspects of curation ranging from in- 
teractions with the community, to business 
models, to technical architectures and services. 
One important example is the National Vir- 
tual Observatory Project (NVO), which is sup- 
ported by the NSF and led by Johns Hopkins 
University, California Institute of Technology, 
and the Space Telescope Science Institute and 
engages numerous partners in academic in- 
stitutions, publishing, and research centers in 
the US and internationally. Until recently, the 
project has not undertaken data preservation 



efforts, but a new grant from the US Institute 
of Library and Museum Services (IMLS) will 
enable it to develop prototype projects in data 
curation: capturing, curating, preserving, and 
providing access to the wealth of data that the 
NVO has accumulated and makes available to 
investigators. Goals for the new work include 
assessing scientific impact as well as working 
out sustainable business models (see Chapter V 
for more discussion about issues surrounding 
economic sustainability.). 5 

The NVO is a coalition of archives and re- 
search projects based at different facilities and 
includes diverse instrumentation. The core of 
the NVO's structure is interoperability enabled 
by common metadata standards that allow in- 
dividual archives and collections to be made 
visible to the end user through a common por- 
tal or application. The underlying heterogene- 
ity is thus masked to the user but preserved as 



Box 11-1 . Data archives and repositories (not comprehensive, for illustration only) 

Cambridge Structural Database [crystal structures] 
http: / / www.ccdc.cam.ac.uk/ products/ csd/ 

CISTI Depository of Unpublished Data 

http: / / cisti-icist.nrc-cnrc.gc.ca/ cms/ unpub_e.html 

CrossFire Beilstein [chemistry] 

http:/ / www.mdl.com/ products /knowledge/ crossfire_beilstein/ 

Digital Archive Network for Anthropology and World Heritage 

http:/ / dana-wh.net /home/ 

Earth Observing System Data Gateway 

http:/ / redhook.gsfc.nasa.gov/ %7Eimswww/ pub/imswelcome/ 

Global Biodiversity Information Facility [prototype data portal] 
http: / / www.europe.gbif.net:80/ portal / index.jsp 

Global Change Master Directory 

http:/ / gcmd.nasa.gov/ 
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Goddard Earth Sciences Data and Information Services Center 

http: / / daac.gsfc.nasa.gov/ 

Inter-university Consortium for Political and Social Research (ICPSR) 
http:/ / www.icpsr.umich.edu/ 

IRI/LDEO Climate Data Library 

http: / / ingrid.ldgo.columbia.edu/ 

IRIS (Incorporated Research Institutions for Seismology) 
http: / / www.iris.edu/ data / data.htm 

Land Processes Distributed Active Archive Center (LP DAAC) 
http: / / edcdaac.usgs.gov:80/ main.asp 

Marine Geoscience Data System 

http:/ / www.marine-geo.org/ 

Multimission Archive at STSci [Astronomy] 
http: / / archive.stsci.edu/ 

NASA's High Energy Astrophysics Science Archive Research Center 

http:/ / heasarc.gsfc.nasa.gov/ 

NASA Space Science Data Archives 

http: / / science.hq.nasa.gov/ research/ space_science_data.html 

National Center for Atmospheric Research & the UCAR Office of Programs 

http: / / www.ucar.edu/ tools/ data.jsp 

National Center for Biotechnology Information 

http:/ / www.ncbi.nlm.nih.gov/ 

National Center for Ecological Analysis and Synthesis (NCEAS) Data Repository 

http: / / knb.ecoinformatics.org/knb /style /skins/ nceas/ index.html 

National Oceanographic Data Center 

http:/ / www.nodc.noaa.gov/ 

National Virtual Observatory Project 

http: / / www.us-vo.org / 
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Oak Ridge National Laboratories DAAC 

http: / / www-eosdis.ornl.gov/holdings.html 

Statistical Reference Datasets 

http: / / www.itl.nist.gov / div898 / strd / 

Statistics Canada 

http: / / www.statcan.ca/ start.html 

TranStats [Transportation] 

http: / / www.transtats.bts.gov/DataIndex.asp 

U.S. Department of Agriculture Economic Research Service Data Sets 

http:/ / www.ers.usda.gov /data/ 

U.S. Geological Survey, National Satellite Land Remote Sensing Data Archive 

http: / / edc.usgs.gov/ archive/ nslrsda/ 

Worldwide Protein Data Bank 

http:/ / www.wwpdb.org/ 
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relevant to the individual project or archive. In- 
vestigators publish their findings in a range of 
publications, which have actively cooperated in 
the NVO. While the journals do release a subset 
of the data (typically in images and tables) the 
underlying observational data rarely appear 
in the journals, in part because investigators 
themselves are reluctant to release work that 
is still in progress. Yet, these results, however 
preliminary, are clearly of use to future inves- 
tigators, hence the need for a close coupling 
between the archive and the scholarly commu- 
nication process (a point also made in Chapter 
IV, New Partnerships). 

The architecture, which reflects both the 
roles of partners and the technological capaci- 
ties, is shown in Figure II-5. In technical terms, 
the architecture combines NVO Web services 
with a distributed repository system based on 
Fedora in a framework that supports long-term 
digital archiving of astronomical derived data 
in a variety of formats (for example, tables, 
catalogs, spectra, images, and documents). The 
distributed nature of this preservation frame- 



work is significant because it acknowledges the 
importance of multiple parties being responsi- 
ble for different functions, depending on their 
relative expertise. The modular nature of the 
technical system bolsters the ability to support 
different components and elements over time 
without requiring expensive or difficult over- 
hauls of the entire system. 

The architecture is clearly compatible with 
key elements of the OAIS architecture. It is 
based on a designated community, namely as- 
tronomers and related institutions (libraries, 
journals); it identifies an environment with- 
in which the specific curatorial and archival 
functions occur, namely the library (which is, 
in fact, a group of libraries that operate coop- 
eratively); and it allows for coherent access by 
several stakeholder groups (the publishers, the 
scientists who deposit data, and the research- 
ers who use data from the collections); and it 
provides for storage. The "storage" is actually 
a set of services and facilities that function over 
a set of repositories that accommodate dispa- 
rate and heterogeneous data that share com- 
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Figure II-5. 

National Virtual Observatory Architectural Components 
Source: S. Choudhury et al.. Digital Data Preservation in Astronomy, 2nd International Digital 
Curation Conference: Digital Data Curation in Practice, Glasgow, UK, November 21-22, 2006. 
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mon metadata standards. The storage function 
is replicated to improve performance and en- 
hance stability and security of the system. 

Recommendations of the Breakout Group 

The group recognized the need for a change in 
the way that NSF and other research agencies 
structure their calls; hence there is a need for 
a change in the culture of the funding agen- 
cies relative to data stewardship, as well as 
for changes in the culture of the research en- 
terprise that these agencies support. One par- 
ticipant commented, "The large goal for NSF 
should be: find a way to make data description, 
integration, and archiving part of the scientific 
process. In proposals they should require, en- 
force, and fund a data management plan ac- 
cording to agreed upon standards." The group 
agreed on recommendations for future action 
and research that address the dominant chal- 
lenges of cross-disciplinary cultures, interoper- 
ability across heterogeneous data, models and 
systems, and incentives. The primary targets 
for these recommendations fall in four areas: 

Recommendation II. 1: Build capacity and 
models 

Recommendation II. 2: Develop policy in- 
frastructure to create a culture of sharing 

Recommendation II. 3: Promote education 

Recommendation II. 4: Promote research 
initiatives 

These are briefly discussed in the following 
paragraphs. 

Recommendation II. 1: Build capacity and 
models. This recommendation calls for sup- 
porting the development of collaboratories that 
model essential infrastructure for community- 
based, data-enabled research. Such collaborato- 



ries may be built upon existing resources or be 
newly conceived and could exist in a number 
of institutions and settings, incorporating 1) 
stand-alone institutions or organizations at lo- 
cal, regional, national, or international scales; 2) 
research and academic libraries, consortia, and 
special collections; and 3) data centers housed 
in universities, scientific and professional soci- 
eties, and other heritage institutions. Such col- 
laboratories could include projects that: 

o Address small, medium, and large data col- 
lections 

o Encourage collaborations among multiple 
stakeholders 

o Instantiate data models, technical, and or- 
ganizational architectures 

o Reflect discipline-based or cross-disciplin- 
ary data collection, submission, and reuse 

o Incorporate development of a sustainabil- 
ity plan 

Specific research topics include: prototyping 
technical architectures under different organi- 
zational and collaborative arrangements; speci- 
fying ingest systems at different scales; deploy- 
ing interoperable data models across organiza- 
tions; developing tools for automatic metadata 
creation and for methods to "harvest" informa- 
tion about collections that might not be part 
of existing centers but might be of interest to 
investigators or potential candidates for future 
inclusion. 

Recommendation II. 3: Develop policy infra- 
structure to create a culture of sharing. This 
recommendation calls for creating data man- 
agement policies to ensure that the contribu- 
tion of research data is considered a shared as- 
set, enabling reuse in new research contexts as 
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shared public goods. This includes: 

o Understanding how to motivate and incen- 
tivize researchers to contribute digital data 
to collaborative environments (deposit). 

o Understanding the range of rights manage- 
ment, data confidentiality, security, and pri- 
vacy concerns. 

o Developing support structures and training 
programs that will encourage investigators 
to prepare archive-ready data and objects 
with appropriate metadata and formats. 

Specific research topics include: surveys to un- 
derstand patterns of use, deposit, and reuse of 
archival data across disciplines; work with jour- 
nal editors, librarians, and others in the scholar- 
ly communication process to understand their 
requirements and how management of scien- 
tific data might be integrated into existing in- 
formation flows; specifying the potential range 
of rights management, security, confidentiality, 
and privacy concerns among institutions, col- 
lections, and individuals, and ways for manag- 
ing these concerns now and in the future. 

Recommendation II. 3: Promote education. 

This recommendation calls for NSF among oth- 
ers to stimulate the development of expert data 
curators and an informed scientific community. 
Specific suggestions involved: 

o Partnering with IMLS to support new pro- 
grams in data science/ curation. 

o Supporting cultural change to encourage 
scientists and engineers to contribute to 
digital data collaboratories. 

o Contributing to the development of curri- 
cula to create next generation scientists and 
information specialists. 



Specific areas of research and funding might 
include: surveys of existing curricula to deter- 
mine needs and support for model programs; 
and training programs, particularly those that 
might be aimed at graduate students and ju- 
nior faculty, consistent with other programs 
for young researchers that NSF currently spon- 
sors. 

Recommendation II. 4: Promote research. 

Each of the previously described recommenda- 
tions might support various research topics. In 
addition, the group identified a series of dis- 
crete problems related to infrastructure. These 
include creating and developing: 

o Interoperability across institutions, collec- 
tions, and heterogeneous data 

o Reference tools to support specific kinds of 
collections and continued use (for example, 
gazetteers to map and monitor changes in 
geopolitical boundaries and terminologies) 

o Architectural options to address risk man- 
agement 

o Specification of intellectual property and 
access rights 

o Collaboration tools 

o Security and trust 

o Cross-disciplinary discovery 

o Attributes of discipline-based repositories 

o Appropriate standards 

o Automatic generation of metadata 
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1 Consultative Committee for Space Data Systems, 
Recommendation for Space Data System Standards, 
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(OAIS), CCSDS 650.0-B-l Blue Book (January 2002). 
http:/ / public.ccsds.org/publications/ archive/ 
650x0bl.pdf. 

2 Ibid., p. 2-2, especially Figure 2-1. 

3 Ibid., p. 4-1, especially Figure 4-1. 



4 National Library of New Zealand. Metadata Stan- 
dards Framework — Preservation Metadata (Re- 
vised), June 2003, p. 4; http: / / www.natlib.govt.nz/ 
files/ 4initiatives_metaschema_revised.pdf. On the 
mapping to the OCLC/RLG model, see Appendix 
6 of this document. 

5 S. Choudhury et al.. Digital Data Preservation in 
Astronomy, 2nd International Digital Curation Con- 
ference: Digital Data Curation in Practice, Glasgow, 
UK, November 21-22, 2006. 
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III. New Partnership Models 




The importance of partnerships is the domi- 
nant critical theme in the participants' position 
papers. The charge to the breakout group asked 
the questions: What mix of individuals and or- 
ganizations should be involved in data pres- 
ervation? What creative partnerships can be 
developed between the multiple sectors? That 
is, who should be involved? What are their re- 
spective contributions, roles, and responsibili- 
ties? 

Discussion of the Issues and Challenges 
The breakout group outlined a series of chal- 
lenges and issues primarily as they affected 
and were affected by research and academic 
libraries. Three major information process- 
es: the scholarly communication process, the 



knowledge transfer process, and the life cycle 
of research formed a matrix within which the 
group considered the role of librarians, the re- 
quirements of preservation and curation facili- 
ties, and short- versus long-term needs. On the 
basis of this discussion, the group was able to 
characterize the relevant stakeholders and out- 
line a series of considerations from which new 
partnerships might be modeled and eventually 
established. 

Models, Processes, and Stakeholders. The 

three information processes, the scholarly 
communication process, the knowledge trans- 
fer process, and the life cycle of research, are 
shown in Figures III-l, III-2, and III-3. Figure 
III-l shows how the traditional role of research 
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< Chuck Humphrey (Alberta)> 



ARL Workshop on New Collaborative Relationships 



26-27 September 2006 



Figure III-2. The Life Cycle of Research 

Source: Chuck Humphrey. The Role of Academic Libraries in the Digital Data Universe, 

September 26, 2006 



The Knowledge Transfer Cycle 
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Figure III-3. The Knowledge Transfer Cycle 

Source: Chuck Humphrey The Role of Academic Libraries in the Digital Data Universe, 

September 26, 2006 
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and academic libraries is shifting from a focus 
on managing information in its published form 
to managing the digital data sets on which find- 
ings may be published. Figure III-2 illustrates 
the life cycle of research through which data are 
generated and Figure III-3 embeds the informa- 
tion in a social context, showing how informa- 
tion is used at different steps of the knowledge 
transfer process. 

Some of these steps generate information 
that may be curated; in others, the information 
may be used and transformed. Each of these 
cycles has a series of participants and stake- 
holders on which partnerships may be based. 
Stakeholders include: 

o Universities 
o Libraries and librarians 
o Domain specialists 
o Computer scientists 
o Standards-setting bodies 
o Editors 

o Professional societies 
o Publishers 

o Commercial and not-for-profit vendors 
o Funding agencies 

But as noted in many of the position papers, 
these stakeholder groups have different exper- 
tises, outlooks, assumptions, and motivations 
about the use of data, so that forging partner- 
ships also requires transcending and reconcil- 
ing cultural differences. For example, in one 
position paper the author commented, "Cur- 
rently, the responsibility for management of 
data accompanying scientific publications falls 
to publishers in the form of supplemental data 
collections. Academic journals have little incen- 
tive to invest in the establishment and mainte- 
nance of digital data repositories that can be 
used for anything beyond minimal documenta- 
tion of published reports." 1 Yet another author 
observed, "The traditional practice of gather- 
ing a paper trail of research outputs long after a 



scientific investigation has been concluded and 
depositing them with an archive is inapplicable 
in the digital era. Too much valuable research 
data are either at high risk of being lost or have 
been destroyed because of inappropriate prac- 
tices that were carried over from a time when 
paper was the dominant medium. The chal- 
lenge is to coordinate among partners the care 
of research data throughout the life cycle." 2 

Motivation and Heterogeneity. One of the 

most challenging issues is motivating individ- 
ual investigators to deposit their data. Repeat- 
edly, the position paper authors called attention 
to the need to motivate researchers. In part, this 
is a question of raising awareness of the future 
potential value of the data. But curation of in- 
dividual investigator's data sets also requires 
resolution of the tension between the promo- 
tion and tenure system in which researchers 
understandably wish to protect their research 
asset in the form of their data set and the pos- 
sible broader use of that data. 

Moreover, as a practical matter individual- 
ized data sets can be highly idiosyncratic and 
difficult to integrate into larger collections, 
leading to the questions surrounding metadata 
discussed in the previous chapter and to issues 
of appraisal and management. That is, what is 
determined appropriate for inclusion in a col- 
lection and how are those decisions reached? 
(This is reflected in Figure III-2 by the arrow 
between data access and dissemination and 
data repurposing.) As one participant noted in 
the plenary discussions, "In conjunction with a 
study we did in Canada, we looked at the size 
of the problem in preserving research from hu- 
manities/social sciences. Between 505 and 595 
out of every 1,000 funded projects resulted in 
the creation of data/ databases. There's a lot of 
data being created. How significant is this?" 

Distributed Systems and Hand-offs. The ob- 
vious solution is distributed systems in which 
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different disciplines and entities undertake re- 
sponsibility for different "pieces" of the land- 
scape. But working out the specifics is chal- 
lenging and resources to support such facilities 
are required. If the universities undertake to 
support these activities, either by housing cu- 
ratorial facilities on their campuses or by par- 
ticipating in consortia, there is concern that the 
less well-resourced institutions could be dis- 
advantaged. Moreover, while it is agreed that 
there are multiple stakeholders and potential 
partners, difficulties may arise in determining 
respective responsibilities and where hand-offs 
occur. If the author of a paper is responsible for 
seeing to the disposition of the data, what is the 
role of the journal and the journal editor when 
an article is published based on part or all of the 
data? Should publishers take responsibility for 
maintaining complete runs of their journals and 
the data on which the articles are based? And 
how does this model compare with the morgue 
traditionally maintained by major newspapers, 
which comprises the "paper of record" and not 
the reporters' notes, polls, and opinion survey 
or other original data that contemporary news- 
papers frequently commission? 

This problem of hand-offs (or more gener- 
ally, of interfaces between complementary and 
potentially overlapping roles) is complicated 
by differing time scales. Current partnerships 
tend to focus on interoperability and integrated 
access but lack a long-term component, which 
is fundamental to partnerships whose mis- 
sion is long-term data stewardship and to en- 
suring sustainability and hence confidence or 
trust in the system. So it becomes necessary to 
align short-term pressing needs with long-term 
goals. The time scale issue is compounded by 
tensions over open access, rights management, 
and policies governing protection of confiden- 
tiality, limitations on liability of data sources, 
use of data for humanitarian purposes, and 
definitions of appropriate use. 3 



The Role of Research and Academic Librar- 
ies. Piecing together compatible roles in new 
partnerships requires parties to think through 
their respective responsibilities and how these 
change in the digital environment. The Na- 
tional Virtual Observatory's new prototyping 
offers one example of how the research library 
can function in this environment, as specified 
in the OAIS model (see discussion in Chapter 
II). Librarians face both changing roles and 
changing perceptions of their roles relative to 
information (as implied in Figures III-l and 
III-3). Traditionally, libraries have focused on 
information discovery, rather than information 
management, and, as one position paper au- 
thor argued, the reference function continues to 
be an important function in the digital age. In 
addition, libraries have tremendous credibility 
on campus as shown in a survey of information 
use at four-year liberal arts colleges and Ph.D. 
granting public and private universities that 
was funded by the Andrew W. Mellon Founda- 
tion. Ninety-eight percent of the faculty, gradu- 
ate students, and undergraduates included 
in this study agreed with the statement, "My 
institution's library contains information from 
credible and known sources." 4 

That said, the group found that research and 
academic libraries need to expand their port- 
folios to include activities related to storage, 
preservation, and curation of digital scientific 
and engineering data. This requires evaluating 
where in the research process chain (Figure III- 
2) curation and preservation activities (Figure 
II-2) should take place, what capacities should 
be built to support these activities (see Infra- 
structure), and who is best suited to undertake 
these activities. This begs three related ques- 
tions: Where do partnerships come into play? 
Where are the hand-offs? How do we lower the 
barriers to participation? Experience suggests 
it is difficult for research and academic librar- 
ies to have the expertise to curate in every do- 
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main area. It must be a shared responsibility. 
Librarians can identify where skills lie across 
domains and to coordinate them so faculty can 
do discovery. Universities have played a lead- 
ership role in the advancement of knowledge 
and their libraries have shouldered substantial 
responsibility for the long-term preservation of 
knowledge. An expanded role in digital data 
stewardship for some of these universities and 
libraries, along with other partners, is a topic 
for critical debate and affirmation. 

"It takes a research community to preserve its 
data." 

The group agreed that given the scale of the 
challenge, the responsibilities should be distrib- 
uted across multiple entities and partnerships 
that engaged their respective communities. 
During the plenary discussion, one participant 
noted that some people managing discipline- 
based archives are worried about institutions 
or library programs taking on this responsi- 
bility. So, while the specific responsibilities 
might be distributed, it would also require a 
tight partnership with the library or institution 
and the discipline. "As a researcher I'd love to 
have help of someone with expertise" to assist 
with curation and management. Others in the 
plenary discussion suggested looking to or- 
ganizational models offered by museums and 
archives where the notion of the community 
served by the institution is not bounded by a 
single institution in the way that research and 
academic libraries are situated within universi- 
ties and colleges. 

In any case, whether a museum or research 
or academic library, it is important to build the 
necessary expertise and equally important to 
build an appreciation of the need for this ex- 
pertise. 

Outreach and Education. The group outlined 
a series of steps that would be required to ad- 
dress the challenges of building new partner- 



ships. The first is to raise awareness and meet 
the needs in the research community. Outreach 
to the scientific community to raise awareness 
of the need for long-term data stewardship 
and to encourage researchers to reuse data was 
a theme that appeared in several of the posi- 
tion papers as well as in this session. Wrote one 
participant, "Many scientists continue to use 
traditional approaches to data, i.e., developing 
custom data sets for their own use with little at- 
tention to long-term reuse, dissemination, and 
curation. Although there has been considerable 
progress in data stewardship for 'big science' 
projects, even modest collaborative projects are 
inconsistent in their attention to data manage- 
ment and few individual scientists think be- 
yond posting selected results and data on the 
Internet or submitting a final data product to a 
data archive if required to do so. Changing this 
sort of behavior will require a range of efforts, 
...perhaps most important of all, concerted ef- 
forts to educate current and future scientists to 
adopt better practices ." 5 Librarians and "data 
scientists" also require new training. "We must 
ensure that the talent to preserve scientific data 
will be available," wrote one participant. "The 
preferred approach is to provide incentives for 
computer science and library science depart- 
ments to include suitable disciplines in their 
curricula ." 6 

Research: Metrics, Motivations, and Technical 
Requirements. Workshop participants called 
for more formal metrics and studies similar to 
market research. For example, it was suggest- 
ed that when proposals are written to NSF, a 
data management plan could be written into 
the budget as part of the budget justification. 
When the final report is submitted, it might in- 
clude information on the execution of the plan. 
Then, the Foundation might gather the data, as 
it does on scientific and technology and engi- 
neering indicators, and use this information to 
understand the patterns as well as to underline 



Long-term Stewardship of Digital Data Sets in Science and Engineering • 43 




the importance of the data management exer- 
cise. In short, this idea has a double resonance: 
it highlights data management as a metric and 
calls attention to its importance, precisely be- 
cause it is a metric. By extension, such metrics 
provide incentives to research both as require- 
ments and as a potential source of recognition. 

Clearly, the viability of partnerships among 
institutions is based on people to staff them and 
the willingness of investigators to take advan- 
tage of them. Systematic information is lacking 
about the willingness of researchers outside of 
the fields, like astronomy or genomics, where 
the massive data sets are central to the research. 
Consistently, the position paper authors cited 
reluctance among many researchers to engage 
in partnerships to curate digital data. But at 
least one study of this sort is in progress at ICP- 
SR. Statistical Disclosure Control: Best Practices 
and Tools for the Social Sciences, led by Jo Anne 
McFarland O'Rourke and Myron Gutmann 
and funded by the National Institute of Child 
Health and Human Development (NICHD), 
constructed a national sample of National In- 
stitutes of Health (NIH) and NSF researchers 
who collected original social or behavioral be- 
tween 1998 and 2001. 7 In addition to issues spe- 
cifically about disclosure, the survey covered 
related topics, including data sharing practices, 
concerns about risks to data subjects' privacy 
in files that are shared, and procedures and re- 
sources for preparing files to be shared. It was 
fielded in April — July of 2006 and the results are 
not yet analyzed. However, it is expected that 
the results will help establish a climate for data 
sharing by developing an instrument that oth- 
ers can use and documenting some of the barri- 
ers investigators face, the extent to which they 
do share data, the ways in which they share, 
and the training in which they would be inter- 
ested. One of the overall aims of the project is 
to develop tools to assist researchers in evalu- 
ating their data for risks of disclosure and pro- 
vide guideline to help them determine appro- 



priate measures to protect against those risks. 
Both the instrument, which will eventually be 
made available for broader use, and the results 
are critical to understanding the environment 
within which viable institutional partnerships 
might be built. 

Having recognized the interest and the 
need for curatorial facilities by educating the 
research community, the next step is to under- 
stand and define the requirements for reposito- 
ries (for example, granularity, metadata, rights 
management, and so on); many of these issues 
were addressed by the Infrastructure Group 
(see Chapter II). The management structure 
of the institutions in which these repositories 
might be located should distribute stewardship 
responsibilities across a body of responsible 
parties roughly equivalent in magnitude (i.e., 
size, capacity) to the magnitude of the collec- 
tive data in need of stewardship and reflective 
of the disciplines that create data in the data 
repository and that might use that data. The 
management structure should also ensure that 
the work environments of those responsible 
parties are well supplied with curatorial tools 
that facilitate their carrying out their responsi- 
bilities. Clearly, a range of structures might be 
possible (a point the Infrastructure group also 
made), and a period of prototyping and test- 
ing would be required before new capacities 
were deployed. Key elements of any institution 
would be measurement and an appropriate 
business model (see Chapter IV). 

Recommendations of the Breakout Group 

As the last paragraph suggests, the structure of 
any partnership requires attention to the infra- 
structure, both the infrastructure of a given fa- 
cility as well as the infrastructure more broadly 
that would enable cooperation across individu- 
al institutions. Indeed, the library and archives 
professions have been leaders in the develop- 
ment of organizational relationships that en- 
able libraries and archives to function coherent- 
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ly across a broad range of specific institutional 
instantiations. That said, the New Partnerships 
group made one major recommendation and 
five narrower recommendations that built out 
aspects of the principal recommendation. 

Overarching Recommendation 

NSF should facilitate the establishment of a 
sustainable institutional framework for long- 
term data stewardship involving the players 
previously enumerated (e.g., libraries, univer- 
sities, professional societies, scholarly jour- 
nals/ publishers, disciplines, etc.). This was the 
major, overarching recommendation put forth 
by this group. The framework must encourage 
the articulation of what constitutes "curation" 
in various disciplines, encourage a diversity of 
designs and approaches that are sympathetic to 
the needs, practices, and relationships within 
affected research communities, and encourage 
the development of distributed partnerships 
between research libraries and research insti- 
tutions. As part of this exercise, it should be 
expected that there is a balance between pro- 
totypes and long-term commitments as well as 
the capacity to evaluate the success of different 
approaches and to migrate data sets collected 
from experimental to long-term facilities as dif- 
ferent models evolve. 

Moreover, these approaches, it was pointed 
out in the plenary session, should consider the 
model of other heritage institutions (archives, 
museums, scholarly societies) where commu- 
nities of interest transcend university boundar- 
ies and merge across several disciplines. These 
groups have grappled with many of the issues 
that drive concerns associated with manage- 
ment of digital collections. For example, with 
respect to confidentiality and individual pri- 
vacy, archivists are used to taking in items with 
use restrictions. That concept allows archivists 
to understand how to manage data sets, for ex- 
ample with researchers who want information 



embargoed for a certain period of time. 

Pursuant to this overarching recommenda- 
tion, the breakout group articulated five more 
recommendations that executed specific as- 
pects of this broad idea. Namely, 

Recommendation III. 1: NSF should fund 
pilot projects / case studies that demonstrate 
the intersections between libraries, a limit- 
ed number of scientific /research domains, 
and extant technologies bases. 

Recommendation III. 2: NSF should fund 
projects in which research libraries develop 
deep archives of irreplaceable data, assur- 
ing descriptions of these data at a minimal 
level (floor, not ceiling), and facilitating dis- 
covery and access to these data, according 
to prevailing community standards. 

Recommendation III. 3: NSF should require 
that data management plans submitted as 
part of the application process identify the 
players involved in the custodial care of 
data for the whole of its life cycle and should 
support training initiatives to ensure that 
the research community can fulfill this re- 
quirement. This would include promoting 
new curricula, developing new programs, 
and linking the training of domain scien- 
tists with library/ information scientists. 

Recommendation III. 4: NSF should fos- 
ter the training and development of a new 
workforce in data science. 

Recommendation III. 5: NSF should partner 
with IMLS to train information and library 
professionals (extant and future) to work 
more credibly and knowledgeably on data 
curation as members of research teams. 

The plenary session generally concurred with 
these recommendations with two caveats: the 
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structure of incentives and the possible need 
for programs to reduce the potential disadvan- 
tage to smaller, less well-resourced institutions. 
Both of these topics are discussed in more de- 
tail in Chapter V, where summary conclusions 
and recommendations are presented. 



Endnotes 
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IV. Economic Sustainability 




An economically viable system for data stew- 
ardship requires prevention rather than rescue 
of endangered materials. As one participant 
noted, 1 what is needed is a system of diagno- 
ses of similar problems, treatment protocols 
and good practices, and criteria for making 
decisions. The charge to the Economic Sustain- 
ability group was: What models are required to 
sustain data management and preservation ef- 
forts over the long term? This broad question 
entailed consideration of many sub-questions 
and assumptions, among them: what does it 
cost to maintain, preserve, and manage collec- 
tions including access to those collections? Who 
uses the collections? And who should support 
them? 



Discussion of the Issues 

This group covered a broad range of issues, 
including funding strategies, business models, 
the concept of public goods, and motivations. It 
proved to be a complex topic that embraced a 
broad range of topics and concerns with a par- 
ticular focus on sustainability issues. 

Overview and Framework. Rusbridge, one of 
the plenary speakers and a co-chair of the break- 
out group, outlined the range of questions that 
converge under the rubric of economic sustain- 
ability (Figure IV-1). 

These topics fall roughly into seven broad cat- 
egories: 
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Figure IV-1. Economic Sustainability Questions 
Source: Chris Rusbridge, September 26, 2006 
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1. What to sustain? The obvious answer is 
the data sets. Yet the "what" question also 
subsumes other forms of information and 
knowledge, a point also raised in the New 
Partnership Models breakout group in the 
context of the Knowledge Transfer Cycle 
(See Figure III-3). In addition, the model 
makes the important point that the eco- 
nomic sustainability of data stewardship 
also sustains the research enterprise itself. 

2. How to get value? This question has sev- 
eral dimensions: At the level of data, it can 
mean the transformations required to make 
the data meaningful to researchers. It can 
also mean modeling the trade-offs required 
to determine priorities and decisions. Not 
all data need necessarily be afforded the 
same level of processing. Some data may be 
unique and therefore irreplaceable, but in 
other contexts, it may be more cost effective 
to replicate the experiment than to preserve 
the data. (The value can also accrue in other 
ways; the data set may be a source of aca- 
demic credit in itself.) 

3. What payment approaches work? The li- 
brary community has been heralded for 
its ability to maintain large scale organi- 
zational coherence while maintaining ro- 
bust heterogeneity in funding and business 
models at the institutional and local levels. 
Curating data requires a similar coherent 
yet diverse set of approaches that recognize 
cultural factors (some disciplines have been 
historically reluctant to pay for data access 
and use) as well as legacy arrangements 
(the university library is frequently funded 
centrally by the university and not by con- 
tributions from individual departments, 
and thus the cost of maintaining a relative- 
ly specialized data set for a given discipline 
would presumably come from the library's 
budget and not from the department or the 



investigators who created and might use 
the data.) Other models discussed by the 
group included: 

o Inter-University Consortium for Politi- 
cal and Social Research (ICPSR), which 
employs several strategies (subscrip- 
tion, user fees, federal, private funding) 
and is discussed in greater detail later 
in this chapter. 

o The Mormon Church, which combines 
tithing, user fees, and sales. 

o The Public Broadcasting Service, which 
assembles private donations, public 
funding at the federal and state levels 
for operations and for specific projects, 
volunteers, who contribute time and ex- 
pertise, and sales. 

o Volunteer activity, which might take 
various forms, including perhaps 
archiving@home similar to SETI@home, 
or use of distributed commodity re- 
sources as in LOCKSS. 

o Markets, with a number of examples 
suggested, including DRI, data "fu- 
tures," shares, and so on. 

o Hybrid funding, which comprises a 
mix of public and private funding from 
multiple sources. Public funding, it was 
noted, can take the form of contracts of 
several years that are re-competed so 
that the administration of a facility may 
be separated from the facility itself. Con- 
ceptually, the commitment to the data is 
separated from the commitment to the 
particular service organization. 

The University of California, for example, 
has operated the Los Alamos Laboratory 
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for decades in a series of contracts and the 
NSF has similar arrangements for operation 
of its telescopes. This enables the threshold 
costs of construction to be separated from 
the operating costs and enables the govern- 
ment to revisit its share of operating costs 
and to obtain more efficient pricing for 
these services. With respect to the internal 
operations, this also means that the cur- 
rent holder of the contract is incentivized to 
practice reasonable efficiencies as well as to 
maintain the operations in a way that en- 
ables smooth transitions. 

Several variations might be imagined. For 
example, a consortium that holds a multi- 
year contract to operate a regional data cen- 
ter might agree to hand off the actual opera- 
tion each year to a different member of the 
team, an arrangement that provides for re- 
dundancy in expertise and enables transfer 
of experiential knowledge. Moreover, if the 
structure of the consortium required each 
member of the consortium to provide a 
percentage of matching funds, then shared 
matching costs might be equally distrib- 
uted within the team by utilizing their re- 
spective capacities. This approach, how- 
ever, presents many logistical and practical 
challenges. 

4. How to persuade society to pay? This 
question is related to the preceding topic 
and resonates with the themes of outreach, 
communication, and education articulated 
by the New Partnerships breakout group. 
If the data sets are seen as an element of the 
infrastructure, then presumably it would 
be easier to obtain funding for their main- 
tenance, either through public funding as 
a public good, which is discussed subse- 
quently, or through user fees or other forms 
of payment by the research community. 
Like others, the Economic Sustainability 



breakout group saw incentivizing scientists 
as a critical element and also called for en- 
gaging volunteers. Certainly in some disci- 
plines, notably archaeology and astronomy, 
volunteers have provided critical resources 
and the fields have benefited from the as- 
sociated positive public profile, broad sup- 
port, and direct contributions of time and 
effort. 

5. What capacities are required? This ques- 
tion goes to some of the issues discussed 
in the Infrastructure group (See Chapter 
II) and are necessarily inputs into both eco- 
nomic and business models. It is important 
to note, though, that like the Infrastructure 
breakout group, the Economic Sustainabil- 
ity group, understood the notion of capac- 
ity as encompassing the human resources 
required to manage a data curation/ stew- 
ardship center as well as the physical and 
software facilities. 

6. What capabilities are required? This ques- 
tion addresses the services that a curation/ 
stewardship facility might require to man- 
age the internal processes as well as the ex- 
ternal relationships with users, administra- 
tors (if the curation facility is housed within 
a larger institution), sponsors, and so on. 
These capabilities might be provided by 
humans but might also be automated and 
hence reliant on technology. 

7. Economic roles. This topic embraces a num- 
ber of issues relating to the economic roles of 
the users, of the investigators who deposit 
collections, of the NSF and other sources of 
public funding, and of the curation/ stew- 
ardship facility itself. Ultimately, the facil- 
ity represents an intellectual asset with eco- 
nomic value, although assessing this value 
is very challenging. Indeed, the economic 
value of intangible assets is an active area of 
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contemporary research and one that might 
become useful in future studies. 2 

As several of the group participants pointed 
out, there is an important distinction to be 
drawn between the economic model, which ad- 
dresses questions that arise from the economic 
system, and the business model, which looks at 
how an individual institution remains solvent. 
The former looks at societal, macro, and micro- 
economic issues; the latter is highly context de- 
pendent. Thus, the economic models consider 
the role of the government based on its aggre- 
gate behavior and funding trends; the business 
model for a research or academic library might 
develop a plan based on its endowment, fund 
raising, alumni support, parent budget, and 
needs of its faculty, students, and research pro- 
grams in which direct public funding from ei- 
ther state or local sources might be a relatively 
small component. Both perspectives are impor- 
tant, but they operate at different scales. 

Public Goods. From the economic perspec- 
tive, data and the curation facilities in which 
they are housed are typically viewed as a 
public good. A public good has been defined 
as "a good that must be provided in the same 
amount to all affected consumers." 3 In tech- 
nical terms, this means that the good is both 
nonrival , meaning that one person's consump- 
tion of the good does not diminish the amount 
available to another consumer, and nonexclud- 
able, meaning that one cannot exclude another 
from consuming it. 4 The classic examples are 
sidewalks, clean air, clean water, and national 
defense. However, Varian (among others) has 
argued that there is an important distinction 
between the two properties. Whereas nonrival- 
rousness inheres in the good itself, the nonex- 
cludable property is socially constructed. Thus, 
providing clean water to all makes sense from a 
public health perspective but is, nonetheless, a 
decision. Information shares these public good 



properties. It is nonrival, in the sense that the 
cost of reproduction is very low (although the 
cost of initial production can be quite high) and 
it has been excludable as a result of intellectual 
property regimes. 5 

From the perspective of the data center, in- 
tellectual property rights management is less 
a source of income than a cost of operation. 
However, the notion of a public good high- 
lights some of the challenges of building a 
business model. What constitutes the commu- 
nity of consumers? Moreover, public goods are 
vulnerable to the free rider problem, namely, 
when individuals allow others to provide the 
public goods that they then consume. This is a 
problem that is inherent in archives, where the 
consumer of the data 5, 10, or 100 years hence 
by definition will not have contributed to its 
production. This has been characterized by one 
participant as "public goods with an unknown 
future value." 6 There exists a substantial body 
of economic literature on the problems of infor- 
mation goods, public goods, and the associated 
problems, and the group believed that it would 
be useful to undertake systematic investigation 
of this literature and its relevance to modeling 
the economic sustainability of long-term stew- 
ardship of data. 

Models and Examples. In her plenary remarks, 
Berman sketched out one model that linked the 
type of collection to an organizational structure 
and funding source (see Figure 1-2). Small, lo- 
cal collections might be supported privately, 
perhaps by commercial entities offering "ware- 
housing" space. Regional collections might be 
supported by regional libraries and data cen- 
ters, and large national/ international reference 
collections would be supported by national 
governments. In practice, though, data centers 
could employ a range of funding strategies, 
including relay funding, where a succession 
of grants is linked together; user fees, applied 
to both depositor and user; endowments; and 



50 • To Stand the Test of Time 



membership dues. At ICPSR, which is housed 
at the University of Michigan, approximately 
25 percent of the annual budget comes from 
memberships in the consortium, 10 percent 
from fees from the summer school program, 33 
to 40 percent from long-term projects to provide 
resources to the community, and the remainder 
from "project-based" funding, which is used 
sparingly to support infrastructure, as well as 
from federal agencies and private funders. 

ICPSR is widely acknowledged as a success 
story, both in terms of its acceptance by the rel- 
evant communities and as a business venture. 
Some of the reasons for its success include: 

o It provides a robust environment with low 
barriers to access. ICPSR accepts data in 
many formats subject to a well-defined and 
well-articulated set of criteria. 

o It houses content which is of great value 
to its user communities and maintains that 
content in good form. Thus, ICPSR refresh- 
es the media and migrates formats as neces- 
sary and appropriate. 

o It has a business model and structure which 
reflect the culture of the domain and con- 
stituent users. 

o It provides useful tools associated with 
data. 

o It maintains a trusted repository. Users can 
be confident that the data they deposit will 
be maintained in good order, including ap- 
propriate access controls and hence appro- 
priate levels of confidentiality, and they can 
also be confident that data they obtain from 
the repository will be authentic. That is, the 
data have not been compromised or tam- 
pered with as a result of storage or trans- 
mission within the repository. (Whether the 
data are accurate is a separate issue and not 



one with which the data center need be con- 
cerned.) 

Costs: Facilities and Human Resources. A 

business model recognizes not only revenues 
but also costs, that is, the cost of managing the 
facility: staffing, resources, and so on. Less easy 
to quantify, but potentially far more costly, is 
the cost of labor, particularly the skilled labor 
associated with management and preservation. 
These activities include collection policies (in- 
cluding appraisal, weeding, deaccessioning, 
and so on); data clean-up, normalization, de- 
scription, and preparation for submission (see 
Chapter II. Infrastructure for a discussion of the 
distinction between the submitted and archived 
form of the data in the OAIS framework); and 
collaboration with researchers around schol- 
arly communication, best practices, and related 
topics. 

Appraisal. Appraisal is a particularly challeng- 
ing topic and bears on the mission of the data 
center (and hence on the question of partner- 
ships) as well as on the underlying capacities 
(and hence on the question of infrastructure). 
As a practical matter, archivists and librarians 
are well aware that data is more abundant than 
time to process it. Indeed, the National Research 
Council's report on the Electronic Records Ar- 
chive at the National Archives and Records Ad- 
ministration (NARA) predicted an "avalanche 
of digital materials" resulting from advances 
in computer technology, more finely-grained 
information and transactions, dynamic genera- 
tion of information, communication technolo- 
gy that enables more frequent transactions and 
that generate preservable records (for example, 
text messages, e-mail messages, and so on), 
and the drop in the cost of storage. 7 The report 
called for multiple approaches to preservation, 
recognizing that technologies will continue to 
evolve and that different records require differ- 
ent treatment. From the perspective of evalu- 
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ating collections, workshop participants noted 
that small projects were at greater risk than 
large, since large collections tend to call atten- 
tion to the data by virtue of their size and scale. 
The Human Genome Project, the Protein Data 
Bank, the National Virtual Observatory are all 
examples of large, valuable, well-recognized, 
and well-curated data collections. Moreover, 
because of their scale and innate complexity 
they have multiple sponsors and an inherent 
redundancy. (The storage structure of the Na- 
tional Virtual Observatory was discussed in 
Chapter II.) 

Nevertheless, selection is necessary and se- 
lection policies require the skilled services of 
both information and domain professionals, 
who must address several major questions: 

What is the minimal level of processing re- 
quired to preserve the collection in a way 
that it can be subject to more detailed and 
sophisticated processing at some future 
time? 

How are decisions made about which col- 
lections should be processed fully (within 
the constraints of existing technologies) and 
which should be set aside with only mini- 
mal processing sufficient to ensure stability 
and prevent degradation? 

Can we develop an understanding of how 
data might be re-purposed in ways totally 
unpredicted by its original creator /gath- 
erer/ researcher? 

What legal /ethical constraints encourage 
or discourage the long-term preservation of 
particular data sets? 

Value Proposition. Although these questions 
are phrased in terms familiar to librarians and 
archivists who are responsible for collection 
development, these are also business questions 



that can be understood as a "value proposi- 
tion." Value can be a difficult concept. A collec- 
tion may possess value because it is rare and 
unique. Value can also be assessed as the cost 
to replace it, enabling trade-offs between the 
costs of preservation versus the cost of recon- 
stituting the data. Finally, value can be under- 
stood as the value to the researchers, a measure 
of how much a future researcher may want the 
data and the ability it may afford to ask differ- 
ent kinds of questions precisely because longi- 
tudinal data may exist. These different notions 
of value all have economic and business conse- 
quences that are not fully understood. As put 
by one of the participants in her position pa- 
per, "If we take as a given that not all data are 
created equal and that we will not be able to 
afford to keep everything, how do we decide 
where to invest in preserving data? How do we 
make effective economic decisions in the face 
of uncertainty about the supply of data and the 
future demand? Are there any economic mod- 
els or research issues that provide insights into 
comparable problems? What happens when 
the future value of a particular set of data is 
contingent upon its relationship to other data 
that have been preserved, and can therefore be 
aggregated? At what level of granularity do 
we make selection (e.g., investment) decisions, 
given that deciding what to preserve is a very 
labor-intensive and expensive process ." 8 

Demand. Running through these discussions, 
but not explicitly stated, is the problem of de- 
mand. There is a tendency to look at infrastruc- 
ture primarily from the producer side. Indeed, 
in the formal definition of a public good, de- 
mand is assumed to be roughly uniform; prob- 
lems and disparities arise when it is not, as evi- 
denced by the free rider problem. In practice, 
however, notions of value as well as decisions 
about collections, partnerships, the support- 
ing infrastructure, and hence the economic 
and business models all eventually face the 
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question of demand. Who will want to use the 
collections? Who will be willing to pay even a 
nominal user fee? And how does demand af- 
fect value? To some degree, these questions are 
inherently unknowable. However, it is possible 
to parse knowable aspects through the kind of 
market research and programs of public educa- 
tion and awareness that the partnership group 
discussed. 

Recommendations of the Breakout Group 
The group formulated eight recommendations. 
Many resonate with recommendations made 
by the other groups. They are discussed in the 
following paragraphs. 

Recommendation IV. 1: Involve economics 
and social science experts in developing 
economic models for sustainable data pres- 
ervation; this research should ultimately 
generate models which could be tested in 
practice. 

Economists have begun to make progress in 
addressing various contributing elements of 
the data curation/data preservation problem: 
better understanding of information goods and 
markets, the economics of public goods and 
their relationship to market mechanisms; and 
the valuation of intangibles, including informa- 
tion assets. These developments, if systemati- 
cally examined from the curation/ preservation 
point of view, could be brought usefully to bear 
in formulating testable models. A second set 
of questions arises from the demand side, un- 
derstanding motivations of the various stake- 
holders. A third set of questions examines the 
spectrum of potential business models that 
take into account the scope of collections (lo- 
cal, regional, national, and international); the 
range of organizational arrangements, as set 
forth by the New Partnerships group; and the 
infrastructure requirements, as outlined by the 
Infrastructure group. 



Recommendation IV. 2: Set up multiple re- 
positories and treat them as experiments. 

It is evident that multiple strategies will be re- 
quired to meet the circumstances created by 
different disciplines, collections, and partner- 
ships. Repository experiments should be re- 
quired to develop plans to address key issues 
such as transition between media/ formats/ in- 
stitutions, self-sustainability, exit strategy, etc. 

Recommendation IV. 3: Develop usable and 
useful tools for automated services and 
standards which make it easier to under- 
stand and manipulate data. 

Automated tools to optimize the labor inputs 
of the professionals who staff curation facili- 
ties, as well as to make the researchers more ef- 
fective, are critical. Such tools not only render 
the operations of the data stewardship facilities 
potentially more efficient but they also lower 
the barrier to participation by the researchers 
themselves and can potentially lead to better 
science. This has the positive effect of building 
momentum behind stewardship but also cre- 
ates demand for the data stewardship facilities' 
services as both a depository and a source of 
data. Both can potentially generate income for 
the facility by becoming sources of user fees. 
Many communities are currently unfamiliar 
with the notion of paying to use data, but oth- 
ers are not, as the example of ICPSR illustrates. 
Thus, an education program coupled with 
successful stewardship facilities that meet the 
pent-up demand can create a so-called "virtu- 
ous circle" to support preservation, curation, 
and stewardship in the concerned disciplines. 

Recommendation IV. 4: Require a data 
sharing plan in proposals that has practical 
value (and appropriate support). Plans for 
resource and reference data should contrib- 
ute to community data stewardship. 
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Augmenting the NSF proposal process prompt- 
ed vigorous discussion during the closing ple- 
nary and will be discussed in greater detail in 
Chapter V. The basic idea, as presented in this 
breakout group report, was to include a section 
analogous to sections in the existing budget 
justification in which the proposer explains cer- 
tain elements of the budget request. The ratio- 
nale for including the provision for data is that 
it calls attention to the data collected as part of 
the project, raises its visibility for both the pro- 
poser and the reviewer, and begins to attack the 
problem of what data should be preserved, by 
whom, and for how long. Thus, an investigator 
may very well argue that data collected as part 
of the project does not merit long term preserva- 
tion, but if it does, then the justification begins 
to contribute incrementally to broad criteria for 
preservation within a discipline and also cre- 
ates a demand for the curatorial facility itself. 
Such requirements and facilities, for example, 
have long been a feature of archaeological in- 
vestigations. Archaeological projects, whether 
funded as a research project or as part of the 
national environmental review process, are re- 
quired to cull the collection recovered from the 
excavation and deposit it at a certified reposi- 
tory, typically maintained by the state accord- 
ing to agreed-upon federal guidelines. 

Recommendation IV. 5: Create and enforce 
data sharing policies among NSF awardees 
(e.g., the final report may not be accepted 
unless the awardee is compliant with the 
stated data management plan.) 

Like the preceding recommendation, this pro- 
posal resulted in active discussion during the 
plenary session about burdens potentially 
placed on investigators. The intent, though, is 
to raise the profile of digital data stewardship 
among domain scientists and funding agen- 
cies, create resources for future users, and meet 
the demand and need for these services. 



Recommendation IV. 6: Use the NSF pro- 
gram process to help the research and ac- 
ademic library community take more re- 
sponsibility for the stewardship of scientific 
and engineering research data (potentially 
with other funders). 

Encouraging investigators to deposit their data 
sets in curatorial facilities is meaningful if such 
facilities exist. Although maintaining such digi- 
tal depositories has not historically been part of 
the research and academic libraries' portfolio of 
responsibilities, some have argued, including 
those who participated in the New Partnerships 
group, that it is a reasonable extension of those 
libraries' mission. These libraries generally en- 
joy reputations as trusted facilities on campus 
and elsewhere, and funding and support from 
NSF and other agencies would assist them to 
undertake this responsibility by enhancing 
their capacities in specific instances and thus 
enhancing their credibility in this realm. 

Recommendation IV. 7: Use the NSF pro- 
gram process to facilitate cultural change in 
the research community. 

This recommendation expands on the more 
specific recommendations above, which tied 
data management requirements to the propos- 
al and reporting processes. More generally, it 
was intended to encourage NSF to think about 
a variety of strategies in which the importance 
of preservation /curation/ stewardship might 
be made more visible. For example and as dis- 
cussed by the New Partnerships group, con- 
tributions to data curation facilities might be 
included in the various reports of science and 
technology indicators, and credit might be giv- 
en to creators of data sets through citation in 
NSF-funded reports, while institutions recog- 
nize this value through promotion and tenure 
reviews. 



_ 
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Recommendation IV. 8: Undertake capacity 
and capability building activities. 

This recommendation covers a range of spe- 
cialized activities, including software tools to 
support automatic or semi-automatic metadata 
creation, reference tools and ontologies, moni- 
toring, and so on. It also addresses the need for 
training for domain specialists, data scientists, 
librarians, and archivists, compilation of best 
practices, and consideration of appropriate 
data management policies and procedures. 
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V. Summary, Conclusions, and 
Recommendations 



The stewardship of digital data is fundamen- 
tal to the research enterprise. The data's utility 
ranges from the ability to replicate an experi- 
ment, to the efficiencies associated with reusing 
data, to the potential to ask new kinds of ques- 
tions through new capabilities to integrate and 
manipulate data. So the challenge becomes, 
how do we as a society collectively gather, 
store, manage and make these data sets avail- 
able while respecting a multitude of legitimate 
and sometimes competing interests? As the 
system of higher education itself suggests, an 
ecology of organizations will be required. But 
the organizational ecology that currently char- 
acterizes higher education (including heritage 
and cultural institutions with which they are 
affiliated) evolved over the course of decades, 
built on centuries of tradition and practice. And 
the research enterprise is also dynamic as new 
institutions and disciplines enter the system 
reflecting the ongoing needs of the population 
for new fields of study, continuing education, 
and new arrangements. 

Digital information is fragile and we do not 
have the luxury of letting time take its course. 
The self-organizing archives already identified 
in this report demonstrate that investigators 
will establish and utilize these types of collec- 
tions but the patchwork of archives and data 
centers, many of them supported entirely by 
volunteer efforts, is not sufficient in many dis- 
ciplines to sustain the data over decades with 
enough confidence to inspire widespread use. 
For any system or set of systems to be success- 



ful, the users must trust it. The prestige of the 
NSF is a good start but not sufficient in and of 
itself. Potential users must see the value in the 
system and the system must function; it must 
be able to provide services reliably. Within the 
framework of a national system of data centers 
called for by the NSF, workshop participants 
outlined the challenge, asking a very simple 
question: "What does it take to get there?" 

Summary of Plenary Discussions: Bridging Cultures 
and Creating Incentives 

The discussions and recommendations set forth 
by the breakout groups converged albeit from 
their respective perspectives. Nearly all of the 
groups called for a system of data curation and / 
or stewardship facilities reflecting new kinds 
of partnerships. These should be approached 
experimentally at first through prototypes and 
temporary arrangements to study successful 
models and abstract lessons learned and best 
practices. There was also wide consensus for 

• cross-disciplinary research and the techni- 
cal capabilities and tools to support long- 
term preservation of the data; 

• appropriate access monitoring and access 
controls to protect confidentiality and per- 
sonal privacy as well as system security and 
risk mitigation more generally; 

• automatic creation of metadata; and 

• interoperability over heterogeneous data 
and systems. 
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In addition, the need for current and longer 
term research that would bring to bear the ex- 
pertise of allied disciplines was identified. For 
example, this could include a focus on business 
and economics issues relating to some of the 
problems raised by the establishment of these 
data facilities. Thus, any facility, operational or 
prototyped, should be sufficiently flexible to 
accommodate changes in technology and or- 
ganization, including hand-offs to successive 
operators in a relay structure. 

Workshop participants also agreed that 
education and outreach to scientists, librarians, 
and the public on the topic of data stewardship 
was vital. This would include curricula to train 
a new kind of information professional, build- 
ing on the traditions of librarians and archivists, 
as well as strategies to educate scientists on the 
value of digital curation and the possibilities 
for research that such data may offer in those 
domains where reuse of data is not common. 

Finally, all of the groups acknowledged the 
challenge of communicating and working across 
disciplinary and institutional cultures and the 
burden that placed on developing appropri- 
ate incentives for individuals and management 
policies for institutions and collections. This 
theme echoed through the position papers and 
plenary sessions as well as in the group discus- 
sions. The way forward is inevitably through 
a mix of cross-institutional, cross-disciplinary 
structures that can take multiple forms and can 
range from the national and international or- 
ganizations such as the Protein Data Bank and 
similar data sets to smaller regional centers that 
may be co-located with existing centers to spe- 
cialized collections housed on individual cam- 
puses or within existing museums and libraries 
that serve a well-defined community. One such 
example is the Field Museum with a specialty 
in anthropology. Incentivizing participation is 
key and is complementary to the call for data 
stewardship facilities, since the existence of the 
facility provides opportunities for scientists to 



deposit their collections as well as experiment 
with the type of data such facilities offer. At 
the same time, such participation justifies the 
facilities and creates demands for curatorial 
services. 

Two dimensions of the discussions about 
cross cultural challenges and incentives stand 
out: one concerns the culture within the NSF, 
and the second the culture in the research en- 
terprise as it affects individual researchers and 
mechanisms for motivating their participation 
in digital data stewardship. In summary com- 
ments, it was noted that digital data stewardship 
challenges the NSF culture as a basic research 
agency. Specifically, projects that represent ap- 
plied research, which may be highly relevant to 
the research required to build prototype data 
curation and/or stewardship facilities, can be 
difficult to fund through the standard review 
process, which reflects the values associated 
with basic research. It will be important for the 
NSF to address this concern. 

From the academics' perspective, systems 
of prestige, embodied in promotion and tenure 
review, may act as disincentives to participa- 
tion in long-term curation by tacitly encourag- 
ing investigators to retain control of their data 
and by discouraging them from allocating their 
time, a very scarce resource, to even minimal 
processing of the data. While the scarcity of 
time together with the challenges of extensive 
metadata argue for tools to assist researchers as 
well as for automatic metadata creation, NSF 
could also facilitate change by coming up with 
ways to recognize creation of the data sets and 
to ascribe authorship, in some manner. For 
example, deposit into a certified institution 
might count toward qualifying publications in 
a proposal submission or data curation might 
become one of the science and technology indi- 
cators that NSF compiles. Thus, data manage- 
ment and curation at the individual level be- 
come effectively congruent with existing pres- 
tige systems. 
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A number of ideas were identified to ad- 
vance the NSF process of funding and reporting 
that could provide incentives to researchers, a 
stick as well as a carrot. Several draft recom- 
mendations called for attaching a requirement 
for a data management plan to the budget jus- 
tification, which is an element of NSF submis- 
sions, and to the reporting required at the con- 
clusion. The latter had the additional advan- 
tage of providing a corpus of information that 
NSF might use to understand patterns of data 
creation, deposit, and reuse. 

The concept of data management plans is 
not new to NSF. The National Science Board re- 
port, " Long-Lived Data Collections: Enabling Re- 
search and Education in the 21 st Century," stated, 
"individual or teams of researchers who will 
author and curate the data... need to have a 
strategy for dealing with data from their incep- 
tion to their demise." The NSB recommenda- 
tion provides further detail and guidance to 
the current NSF requirements as presented in 
the Grant Proposal Guide (NSF-04-2). While 
participants agreed with the strategy requir- 
ing a data management plan, a few concerns 
were voiced. Some participants suggested that 
requiring proposers to submit a data manage- 
ment plan, however brief, was an additional 
burden on the investigators' already stretched 
resources. An opposing viewpoint was that de- 
veloping and elucidating a plan for valuable 
research data is part of responsible scholarship, 
but without the mandate that a data manage- 
ment plan be included in the proposal submis- 
sion (in the same way that a budget justifica- 
tion or previous work is included), the culture 
around stewardship and preservation would 
be less likely to change. 

Recommendations 

The recommendations of each breakout group 
have been presented in the context of descrip- 
tions of the discussions that took place within 



those groups. Those recommendations have 
been refined based on plenary discussions and 
by eliminating redundancy. This process result- 
ed in one over-arching recommendation that 
reflects the consensus of the Workshop partici- 
pants. In addition, three general recommenda- 
tions emerged from the group discussions that 
amplify the overarching recommendation. Fi- 
nally, six targeted recommendations build on 
the more general recommendations. 

Overarching Recommendation 

NSF should facilitate the establishment of a 
sustainable framework for long-term steward- 
ship of digital data. This framework should in- 
volve multiple stakeholders by: 

• supporting the research and development re- 
quired to understand, model, and prototype the 
technical and organizational capacities needed 
for digital data stewardship, including strate- 
gies for long-term sustainability, and at mul- 
tiple scales; 

• supporting training and educational pro- 
grams to develop a new workforce in data sci- 
ence both within NSF and in cooperation with 
other agencies; and 

• developing, supporting, and promoting educa- 
tional efforts to effect change in the research 
enterprise regarding the importance of the 
stewardship of digital data produced by all sci- 
entific and engineering disciplines/ domains. 

This overarching recommendation recognizes 
the simultaneous and mutually dependent 
need for both capacity (facilities and resources) 
and motivation, for supply and demand. It also 
recognizes that the capacities do not yet fully 
exist (although there are many examples of ap- 
proaches within many disciplines) and require 
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additional prototyping and research into rel- 
evant technical, organizational, behavioral as 
well as economic and business issues. It was 
recognized that substantial effort in outreach 
and education is required to create an environ- 
ment and mindset conducive to curation, pres- 
ervation, and in some cases, stewardship of 
digital data. These efforts must be taken within 
NSF and other agencies, as well as within the 
cultures of the respective disciplines and or- 
ganizations, including professional societies; 
libraries, archives and other heritage institu- 
tions; publishers, and universities. Finally, just 
as the stewardship of the data requires cross- 
disciplinary collaborations, so too must the 
responsibilities for effecting these goals tran- 
scend the mandate of the NSF. Hence this rec- 
ommendation includes an interagency element 
not unlike the Digital Libraries Initiative, which 
assembled resources from multiple agencies in 
an integrated research program that pursued 
shared goals. 

Three general recommendations emerged 
from the three groups along the following 
themes: 

1. Fund projects that address issues concern- 
ing ingest , archiving ; and reuse of digital 
data by multiple communities. Promote 
collaboration and "intersections" between a 
variety of stakeholders, including research 
and academic libraries, scholarly societies, 
commercial partners, science, engineering 
and research domains, and evolving infor- 
mation technologies and institutions. 

2. Foster the training and development of a 
new workforce in data science. This could 
include supporting for new initiatives to 
train information scientists, library profes- 
sionals, scientists, and engineers, to work 
knowledgeably on data stewardship proj- 
ects. 



3. Support the development of usable and use- 
ful tools , including: 

• automated services which facilitate under- 
standing and manipulating digital data; 

• digital data registration; 

• reference tools to accommodate ongoing 
evolution of commonly used terms and con- 
cepts; 

• automated metadata creation; and 

• rights management and other access control 
considerations. 

These general recommendations and themes 
are amplified by the following targeted recom- 
mendations. 

1. NSF should develop a program to fund proj- 
ects/case studies for digital data stewardship 
and preservation in science and engineering. 
Funded awards should involve collaborations 
between research and academic libraries , scien- 
tific/research domains , extant technologies bas- 
es , and other partners. Multiple projects should 
be funded to experiment with different models. 

NSF should facilitate the establishment of a sus- 
tainable framework for long-term stewardship 
of federally funded research in science and en- 
gineering. To this end, funded projects should 
include multiple partners from key stakeholder 
communities including; universities, academic 
and research libraries, domain specialists, com- 
puter scientists, professional societies, stan- 
dard-setting bodies, publishers, commercial 
and not-for-profit vendors, and funding agen- 
cies. The inclusion of multiple partners and 
interests is reflected in the statement, "it takes 
a research community to preserve its data." 
Given the scale of the challenges, the projects 
should reflect the complexities of the steward- 
ship of digital data within the research enter- 
prise, the distributed nature of use, diverse re- 
sponsibilities, and varied partnership interests 
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and needs. 

The challenges of cross discipline collabo- 
ration are evident in a number of places, nota- 
bly in the heterogeneity of the data sets and the 
cultural, organizational and technical frame- 
works within which the data are collected and 
recorded. This creates a very practical problem 
for curatorial facilities and their staffs. There 
was consensus that distributed approaches are 
required. Yet effective distributed organiza- 
tions and technical architectures require shared 
tools, standards, protocols and procedures to 
enable efficient and interoperable systems and 
collaborations. These needs are particularly ob- 
vious with respect to the processes associated 
with ingest (submission), archiving (including 
storage and management), and reuse (retrieval 
of relevant information in a form useful to the 
future investigator). There has been work in 
various elements (for example, metadata and 
ontologies); less well-understood is the flow 
within organizations and across organizations, 
particularly where interdisciplinary collabora- 
tions are desired. Thus, prototyping these flows 
enables researchers to identify what works and 
what does not, where the hand-offs occur, and 
how to improve both the interfaces and the 
tools that support specific steps. 

2. NSF with other federal agencies such as the 
Institute of Museum and Library Services and 
schools of library and information science should 
support training initiatives to ensure that infor- 
mation and library professionals , and scientists 
can work more credibly and knowledgeably on 
digital data stewardship— data curation , man- 
agement , and preservation— as members of re- 
search teams. 

It is widely acknowledged that a new kind of 
professional is required whose expertise is criti- 
cal to the successful stewardship of digital data 
resources. The digital environment requires 
new tools and skill sets. For example, scientists 



are typically not trained to manage the data 
sets, so there is an equivalent need within the 
disciplines to extend the current methodolo- 
gies that are focused on collecting and analyz- 
ing data to include management of the data. 
Data management affects not only disposition 
after research is concluded but, with potential 
reuse of data, also requires that the scientific 
user more fully understand how data from an 
archive may have been stored and managed. It 
is the equivalent of understanding the potential 
effect of the instrumentation on the observa- 
tion. The social sciences are already attuned to 
examining bias in data; extending that notion 
to understanding the implications for manage- 
ment of data is the logical next step. Thus, both 
the investigators and the data managers have 
a shared interest in managing data more ef- 
fectively. Hence there is a broad need in many 
communities to understand data management, 
either as a future manager or a future consumer 
of the information. 

3. NSF should support the development of usable 
and useful tools , and automated services (e.g., 
metadata creation , capture , and validation) 
which make it easier to understand and manipu- 
late digital data. Incentives should be developed 
which encourage community use. 

Recommendation 1 calls for prototypes and 
models of the entire flow of information in the 
science and engineering arena. This recommen- 
dation builds on it by calling attention to sev- 
eral specific research areas that are subsumed 
into the flows described earlier. All of these are 
well-known problems in digital information 
management, and although there have been 
numerous research projects, the problems are 
far from solved. Moreover, they have not been 
solved with reference to integrated solutions 
that take into account distributed organizations 
and multiple communities. Thus, projects fund- 
ed under this recommendation would comple- 
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ment work undertaken in the earlier recom- 
mendation and might attack a specific problem 
directly. This might include: data registration, 
automatic metadata creation, reference tools to 
accommodate ongoing evolution of commonly 
used terms and concepts and rights manage- 
ment. 

4. Economic and social science experts should be 
involved in developing economic models for sus- 
tainable digital data stewardship. Research in 
these areas should ultimately generate models 
which could be tested in practice in a diversity 
of scientific/research domains over a reasonable 
period of time in multiple projects. 

Technical and economic sustainability were 
identified as critically important issues. Both are 
essential to engendering trust, which is a nec- 
essary precondition to viable curation facilities 
and to the larger framework of data steward- 
ship of which each facility is a part. A number 
of approaches were discussed, combining lo- 
cal, regional and national /international scales 
and disciplinary and cross disciplinary content. 
However, it was generally agreed that develop- 
ing good examples requires studying relevant 
organizational and behavioral issues as well as 
leveraging the research that has already been 
done in information economics, public goods, 
and infrastructure investments. Some of the 
topics that might be addressed include valuing 
intangibles, modeling collaboration, examining 
value proposition under various assumptions, 
motivations, incentives and prestige systems, 
and engendering trust. 

5. NSF should require the inclusion of data man- 
agement plans in the proposal submission pro- 
cess and place greater emphasis on the suitabil- 
ity of such plans in the proposal's review. A data 
management plan should identify if the data are 
of broader interest; if there are constraints on 



potential distribution, and if so, the nature of 
the constraint; and, if relevant, the mechanisms 
for distribution, life cycle support, and preser- 
vation. Reporting on data management should 
be included in interim and final reports on NSF 
awards. Appropriate training vehicles and tools 
should be provided to ensure that the research 
community can develop and implement data 
management plans effectively. 

6. NSF should encourage the development of data 
sharing policies for programs involving com- 
munity data. Discussion of mechanisms for de- 
veloping such plans could be included as part of 
a proposal's data management plan. In addition, 
NSF should strive to ensure that all data shar- 
ing policies be available and accessible to the 
public. 

These recommendations seek to leverage 
the NSF processes to raise awareness of data 
curation in the research community, to acknowl- 
edge the need for the services of data curation 
and/or stewardship facilities, and to motivate 
researchers to participate through the require- 
ments of the funding and reporting processes. 
Several key elements integral to a data manage- 
ment plan were identified. These include: 

• how the data will be managed, by whom, 
and by what mechanism; 

• whether data will be shared with the com- 
munity (and if not, why not); 

• whether the data will be preserved for the 
future, and if so, how; and 

• how the data management, specifying pres- 
ervation, curation, and/or stewardship, 
will be supported during and after the pro- 
posal is funded, as appropriate. 

The information recovered through the report- 
ing processes could be useful to the Foundation 
as it evolves its policies on data management. 
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Importantly, NSF should support training ini- 
tiatives to ensure that the research community 
can fulfill this requirement while providing 
sufficient incentives to the community to en- 
sure compliance. At the same time, it was ac- 
knowledged that there is a potential for such 
requirements to disadvantage applicants from 
less well-resourced universities. Thus, such re- 
quirements should be coupled with programs 
similar to those already in place, such as EPS- 
CoR that seek to redress such imbalances. 



The NSF has an opportunity to model inter- 
nal business processes that other funding agen- 
cies might adopt. NSF has long been a leader 
among the basic research agencies. The work- 
shop participants urged the Foundation to yet 
again to show leadership in the data manage- 
ment arena. As the Foundation has already rec- 
ognized, it is fundamental to science and engi- 
neering research and the education enterprise 
in the digital age. 
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knowledge, and Section 7 — Conclusions. 

Towards 2020 Science(2006). Microsoft Corporation, http:/ / research.microsoft.com/ 

towards2020science / downloads /T2020S_ReportA4. pdf RECOMMEND: Summary 
-page 8, Section 1 — Laying the Ground pages 14-20, Section 4 — Conclusions and 
Recommendations pages 70-74. 

Secondary Readings in General/Overview 

Newman, H. B., Ellisman, M. H., & Orcutt, J. A. (2003). Data-intensive e-science frontier 
research. Communications of the ACM, 46(11), 68-77. 

Vinge, V. (2006). 2020 Computing: The creativity machine. Nature, 440(7083), 411. 

The Grid and the Semantic Web 

De Roure, D., & Hendler, J. A. (2004). E-Science: the grid and the Semantic Web. IEEE 
Intelligent Systems, 19(1), 65-71. 

Berman, F., Fox, G., and Hey. T., editors. Grid Computing: Making the Global 

Infrastructure a Reality," 1 Edition, John Wiley and Sons, LTD, England, 2003. 

Gagliardi, F. (2005). The EGEE European grid infrastructure project. High Performance 
Computing for Computational Science- Vecpar 2004, 3402, 194-203. 
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Gagliardi, F., & Begin, M. (2005). EGEE — providing a production quality grid for e-science. 
Local to Global Data Interoperability — Challenges and Technologies, 2024 June 2005, 



88-92. 



Hendler, J. (2003). Science and the Semantic Web. Science, 299(5606), 520. 

Cyberinfrastructure 

Hey, T., & Trefethen, A. E. (2005). Cyberinfrastructure for e-Science. Science, 

308(5723), 817-821. 

Hey, T., & Tony Hey and Trefethen, A.E.,The Data Deluge: An E-Science Perspective; http: / / 

www.rcuk.ac.uk/ escience/ documents/ report_datadeluge.pdf#search=%22dat a%20cur 
ation% 20and% 20conference% 20and% 20e-science% 22 

Aimes, G., Birnholtz, J. P., Hey, T., Cummings, J., Foster, I., & Spencer, B. (2004). CSCW and 
cyberinfrastructure: opportunities and challenges. Computer Supported Cooperative 
Work Conference Proceedings, 6-10 Nov. 2004, 270-273. 

Berman, F., Berman, J. , C. Pancake, and Wu, L., A Process-Oriented Approach to 

Engineering Cyberinfrastructure, http:/ / director.sdsc.edu/ pubs /ENG /report/ 
EAC_CI_Report-FINAL.pdf. 

Moore, R., Berman, F., Schottlaender, B., Rajasekar, A., Middleton, D., JaJa, J., "Chronopolis — 
Federated Digital Preservation Across Time and Space", IEEECS International 
Symposium on Global Data Interoperability Challenges and Technologies, June 2005. 

National Science Foundation Cyberinfrastructure Council. (2006). NSF's Cyberinfrastructure 
Vision for the 21st Century Discovery National Science Foundation, http: / / www.nsf. 
gov/ od / oci / CI-v40.pdf. 

National Science Foundation, Revolutionizing Science and Engineering Through 

Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon 
Advisory Panel on Cyberinfrastructure, (the "Atkins Report"), http: / / www.nsf. 
gov/ cise/ sci/ reports/ atkins.pdf. 

National Science Board, Long-Lived Data Collections Enabling Research and Education in the 
21st Century, the National Science Board, http: / / www.nsf.gov/ pubs / 2005 / nsb0540 / 
start, jsp. 

National Science Foundation 

Over the past three years, a number of reports and papers on cyberinfrastructure and its 
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impact on research and education have been compiled. Links to a sample of some of the 
reports and papers are listed below. 

Building a Cyberinfrastructure for the Biological Sciences; workshop held July 14-15, 
2003. http: / / research.calit2.net/ cibio/ archived/ CIBIO_FINAL.pdf http: / / research. 
calit2.net/ cibio / report.htm. 

CHE Cyber Chemistry Workshop; workshop held October 3-5, 2004. 
http:/ / bioeng.berkeley.edu /faculty/ cyber_workshop. 

Commission on Cyberinfrastructure for the Humanities and Social Sciences; 
sponsored by the American Council of Learned Societies; seven public information- 
gathering events held in 2004; report in preparation, http://www.acls.org/ 
cyberinfrastructure/ cyber.htm. 

Computation as a Tool for Discovery in Physics; report by the Steering 
Committee on Computational Physics, http:/ / www.nsf.gov/ pubs/ 2002/ 
nsf021 76 / start.htm. 

Cyberinfrastructure for the Atmospheric Sciences in the 21st Century; workshop held 
June 2004. http: / / netstats.ucar.edu/ cyrdas/ report/ cyrdas_report_final.pdf. 

Cyberinfrastructure for Engineering Design; workshop held February 28-March 1, 2005; 
report in preparation. 

Cyberinfrastructure and the Next Wave of Collaboration, D. E. Atkins, Keynote 
for EDUCAUSE Australasia, Auckland, New Zealand, April 5-8, 2005. 

Cyberinfrastructure for Engineering Research and Education; workshop held June 5-6, 
2003. http: / / www.nsf.gov/ eng / general /Workshop/ cyberinfrastructure /index.jsp. 

Cyberinfrastructure for Environmental Research and Education (2003); workshop 

held October 30-November 1, 2002. 

http: / / www.ncar.ucar.edu/ cyber/ cyberreport.pdf. 

Cyberinfrastructure (Cl) for the Integrated Solid Earth Sciences (ISES) (June 

2003); workshop held on March 28-29, 2003, June 2003. 

http: / / tectonics.geo.ku.edu/ ises-ci/ reports/ ISES-CI_backup.pdf. 

Cyberinfrastructure and the Social Sciences (2005); workshop held March 15-17, 

2005. 

http:/ /www.sdsc.edu/sbe/. 
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Cyberlearning Workshop Series; workshops held Fall 2004-Spring 2005 by the 
Computing Research Association (CRA) and the International Society of the 
Learning Sciences (ISLS). 

http: / / www.cra.org/ Activities/ workshops/ cyberlearning. 

Final Report: NSF SBE-CISE Workshop on Cyberinfrastructure and the Social 

Sciences, F. Berman and H. Brady. 

http: / / vis.sdsc.edu/ sbe/ reports/ SBE-CISE-FINAL.pdf. 

Geoinformatics: Building Cyberinfrastructure for the Earth Sciences (2004); 
workshop held May 14-15, 2003; Kansas Geological Survey Report 2004-48. 
http: / / www.geoinformatics.info/ . 

Geoscience Education and Cyberinfrastructure, Digital Library for Earth System 

Education, (2004); workshop held April 19-20, 2004. 

http: / / www.dlese.org/ documents/ reports/ GeoEd-CI.pdf. 

Identifying Major Scientific Challenges in the Mathematical and Physical 
Sciences and their Cyberinfrastructure Needs, workshop held April 21, 2004. 
http: / / www.nsf.gov/ attachments/ 100811 /public/ CyberscienceFinal4.pdf. 

Materials Research Cyberscience enabled by Cyber infrastructure; workshop held 
June 17-19, 2004. 

http: / / www.nsf.gov/ mps/ dmr/ csci.pdf. 

Multiscale Mathematics Initiative: A Roadmap; workshops held May 3-5, July 
20-22, September 21-23, 2004. 

http: / / www.sc.doe.gov/ ascr/ mics/ amr/Multiscale%20Math%20Workshop%203% 
20%20Report%201atest%20edition.pdf. 

An Operations Cyber infrastructure: Using Cyberinfrastructure and Operations 
Research to Improve Productivity in American Enterprises"; workshop held 
August 30-31, 2004. 

http: / / www.optimization-online.org/ 00/ OCI.doc. 
http: / / www.optimization-online.org/ 00/ OCI.pdf. 

Planning for Cyberinfrastructure Software (2005); workshop held October 5-6, 

2004. 

http: / / www.nsf.gov/ od/ oci / ci_workshop/index.jsp. 

Preparing for the Revolution: Information Technology and the Future of the 
Research University (2002); NRC Policy and Global Affairs, 80 pages, 
http: / / www.nap.edu/ catalog / 10545.html. 
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Polar Science and Advanced Networking: workshop held on April 24-26, 2003; 

sponsored by OPP/CISE. 

http: / / www.polar.umcs.maine.edu/ . 

Research Opportunities in Cyberengineering/ Cyberinfrastructure; workshop held 
April 22-23, 2004. 

http: / / 129.25.60.81 / %7Eworkshop/ . 

Revolutionizing Science and Engineering Through Cyberinfrastructure: report of 
the National Science Foundation Blue-Ribbon Advisory Panel on 
Cyberinfrastructure; Daniel E. Atkins (Chair), January 2003. 
http: / / www.nsf.gov/ od / oci / reports/ atkins.pdf. 

A Science-Based Case for Large-Scale Simulation; workshop held June 24-25, 

2003. 

http:/ / www.pnl.gov /scales /docs/ volumel_72dpi.pdf. 
http:/ /www.pnl.gov/ scales/ docs/ SCaLeS_v2_draft_toc.pdf. 

Summit on Digital Tools for the Humanities; workshop held September 28-30, 

2005. 

http: / / www.iath.virginia.edu/ dtsummit/ SummitText.pdf. 

Supplement to the President's Budget for FY 2006; Report by the Subcommittee 
on Networking and Information Technology Research and Development 
(NITRD), February 2005. 
http:/ / www.nitrd.gov/. 

Trends in IT Infrastructure in the Ocean Sciences (2004); workshop held May 21-23, 2003. 
http: / / www.geo-prose.com/ oceans_iti_trends/ oceans_iti_trends_rpt.pdf. 



Data 

Aimes, G., Birnholtz, J. P, Hey, T., Cummings, J., Foster, I., & Spencer, B. (2004). CSCW and 
cyberinfrastructure: opportunities and challenges. Computer Supported Cooperative 
Work Conference Proceedings, 6-10 Nov. 2004, 270-273. 

Choudhury, S., et al. (2006) Digital data preservation in astronomy. 2nd International Digital 
Curation Conference: Digital Data Curation in Practice, Glasgow, UK, November 21- 
22, 2006. 

Humphrey, C. (2004). Preserving research data: a time for action. Symposium of the Canadian 
Conservation Institute on the Preservation of Electronic Records: New Knowledge and 
Decision-making, 83-90. 
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Humphrey, C., & Jacobs, J. (2004). Preserving research data. Communications of the ACM, 
47(9), 27-29. 

Muggleton, S. H. (2006). 2020 Computing: Exceeding human limits. Nature, 440(7083), 409- 
410. 

Szalay, A., & Gray, J. (2006). 2020 Computing: Science in an exponential world. Nature, 
440(7083), 413-414. 

Rusbridge, C., & McHugh, A. (2005). Saving for the nation. [Electronic version]. Information 
Scotland, 3(4). see http:/ / www.slainte.org.uk/ publications/ serials/ infoscot/ vol3(4)/ 
vol3(4)article6.htm. 

Virtoal Organizations (secondary reading) 

Camarinha-Matos, L. M. (2003). Infrastructures for virtual organizations — where we are. 2003 
IEEE Conference on Emerging Technologies and Factory Automation. Proceedings, 
16-19 Sept. 2003, vol. 2 405-414. RECOMMEND Section V — Support for Remote 
Operation and E-Science page 412. 

Conference: Data Curation 

1st Digital Curation Conference: An overview of the 1st Digital Curation Conference can be 

found online at http: / / www.ariadne.ac.uk/ issue45/ dcc-lst-rpt/ . 



Compiled by Kristi Jenkins, Physics and Astronomy Librarian, University of Minnesota. 
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11 :00a.m.-ll :30a.m. 
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Report of the Task Force on Archiving of Digital Information 
Commission on Preservation and Access and the Research Libraries Group 
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User-centric, Multilevel, Nimble, Sustainable, Reliable 
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Report of the Task Force on Archiving of Digital Information, 1996 






Digital Data is Fundamental to 21 st Century 
Research and Education 
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Today’s Applications Cover the Spectrum 
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Public Data Collections Hosted in SDSC’s DataCentral 
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The “Data Pyramid”: An Organizational Structure Opportunities for Partnerships throughout the 
for Talking about Research Data “Data Pyramid” 
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Providing Sustainable and Reliable Data 
Infrastructure incurs Real Costs 









c 


c 


re 


CO 

c 


c 

E 


if 


f I 


= UJ 

St 

n 


E 

•o 

< 

E 


< 

0) 

B 

CO 


25 
S f 
» 2 


is 

o CD 
3 re 


|s 


0) 

1 

>> 


o 

tn 


To o 
O (0 


1 s 


< to 


CO 


+ 


+ + 


+ + 


1 

i . 


re 

n 

c 


CO 

a 

o 


- w 
1 o 


<0 
. 3 


c ” O-S 
e s 2 ; 


c 

CO 


i. 


? 1 
O *to 


<D § 

T C 

E»« 


3;X3i! 
Ea«j « 

1 £ Sf 


£ E 
a. a) 

8 « 


? e 
1 £ 
2 <o 


a ? c 

£ 2 8 
T3 q. ir 

c 2 « 


i- o E 
to Sf 
^ 2 <0 


S o c E 


° >. 
CM CO 


to wT 


— 0) >, 
to SZ to 


73 0) >, 
CO SZ to 


»* 










I 

01 




« 


to 

to 


1. 


1 


O 

>. 


I 

>» 


£ 

in 


1 a 
o re 


U. 




in 




in >% 




SC 

to 


21 

3 


CO 

a> 

.E ^ a. 






«' 


« 


£ ol « 

o . «- a> 

« 0) n Q- 


kT 


0 

1 


V 

0) 


CO 

3 


S ’S 

CO ° 

re re 


0 
o> 
c 

1 
u 

1 


Corrupted m 
failure 


+ Simultanec 
of 2 copies 


+ Systemic e 
vendor SW, < 
Malicious us 
Operator ern 
deletes multi 


•i- Natural dis 
obsolescenc 
standards 






m 






Size 


CO 

s 


o 

o 

o 


m 

h- 

o 


CO 

0. 


CM 

l 


7 


T 


’j' 


** 
= r 


O 


& 


E 

0) 

I 


re 

> 

Z 

o 


c _ 

Ui CO 


E 


* 


>. 

tn 


< 



V) 



8> 

3 *5 

& 2 

tc * 



1 

CJ 

£ 

<0 

2 

c: 



i 

a 

I 

5 

<5 



■IS 

"i 

3 



I 

5P 

$ 

a 

a 



c 

8 

£ 




H 

V) 




o 

■a 

0 



O’ 

0 

0 

0 



0 

■a 

o 

E 

05 

sz 





0 


C 


> 


0 


o 


E 


0 


0 


-e 


CD 


o 


0 


M— 

*+— 


C 


0 


0 


c 


E 


O 


0 

0 


0 

> 


"O 


0 


c 


(/) 




0 


'0 


i— 


-t— • 


Q- 




"O 


0 


C 




0 



C-- 

E 

0 

-*— » 

CD 

c 





_ 



86 • To Stand the Test of Time 









Long-term Stewardship of Digital Data Sets in Science and Engineering • 87 







>* 

E 

o 

c 

o 

i— 

to 

03 



c 

o 



£ 

_c 

o 

'c 

o 

i— 
H— » 

o 

0 

LU 



o <£> 

03 

CO C 75 

a) ^ 

.g d 

Q_< 
O "O 
0) c 
cc 



u d 

M— 

^ < 

c 

a) lo 

8S 

y= a) 



■_zz a) 

03 ^ 



o ^ 
a) 

c .g 

o x= 

03 ~ 
CO _Q 



£ 

> 



■a 
c 
CO 

0 

CO 

03 
_Q 
CO 

1 

CD 
CD 

i % 

® £ 

CD 

q_ 03 



a 

< 

CD 



c 

o 

03 

C 

Q_ 

(D 



CD 



> -2 
9 CD 
CD Q 

1/3 C/3 

O O 

>, -C= 



CD 

03 

=3 

0 
> 
03 
C 
0 
-t— • 

X 

0 

0 

0 

0 

£ 



_>> 

o 

p 

3 

CL 



0 

c 

0 

03 



"O 

c 

0 

"O 

0 

> 

'sz 

o 

i_ 

0 

0 



0 

"O 



E 


0 


F Q - 
0 2 


-C 

0 


£ 


03 


£ 


0 


i_ 


0 


0 




0 


c 


0 


C= 03 




c 


O 


c 


0 


0 


2 .2 


00 


0 


> 


0 


-1 — • 

c n 


0 

0 


tD s 


< 


-t— » 

CO 


>< 


CO 


< 


'o_ 


< p 


1 


< 


0 


< 


• 




• 




• 




• 



Z5 CD 

LU 0 

r 

CO CD 

Z> > 

(/) 

a ) cd 
0 25 
<5 



0 CD 
> > 



= 0 
2 ro 

CD ~a 



O 

-I—* 

03 

£ 

0 

CO 

-Q 

O 

75 

3 

> 

0 



03 0 
E _c 
^ l— 
> . 
O 0 
o © 

o 

1 1 

-X 0 

O -O 
£ © 
0 -S 
£ 5 
£ -b 

4= 0 
0 ^ 
.2 0 " 

o'© 

0 

§ CD 

0 ^ 
0 -9 
-Q ± 

o W 

ll 

S 0 

0 8 

f 1 ° 

i— 0 



o 

-iS 

0 0 
^ 0 

S’-g 

■s fc 

T3 w 

Sg 

CO .2 

$s 

8 S. 

CC E 

£8 

§.= 

° 5 
p g 
10 

m^ 5 
_0 -Q 

CO 03 

P C 

3 !g 

O -Q 

£ 

© o 

0 O 

o c 

> 0 



_c 

CD 

2 S 
SE 

T3 4= _CD 

c§ i9 5 



S »5 

O CO c 

„ m P 



>. » 



^ o o 

80S 

CD 

WO"D 
■a c m- 

000 

o _ 



0 

O 

>» 03 

0 -o ^ 

o g 

0 j= © 
~ - >> 
0 >* 03 



0 \ 

O) 



0 0 0 
■S 2 P 



0 

> (— — VJ 

■j= iE 2 P 



. o o 

j® LO 

"q5 "o * 

^ ° (0 



^ . 
0 w 

o ^ 
o 0 
.0 0 



O — w 

0 5 ° 

O 03 ■ 

0 O 1 — - 1 

© £ ■ 
o CO 0 

8-8-i 

« °.=6 

>> 0 

0 03 c 

0 O C 



ZS w 

o £ 



o CD 

o 



° 0 'c 

o '© o 



0 s 

^ 0 
P 03 



-Q 

S 5 

CD 

I 

0 

P 



P 

CO 



"a 0 Z3 c 



0 0 

O > 

C/3 id 



E >-o> 

°o> 



O 5 O' 

w © o 

<012 



^ 0 
E p 

g I 
a ® 

: 0 



03 

c 3 

Q ^ 

> 



0 



03 

0 

"O is 

0 M— 

LL O 
"O Q- 

£ Z 

03 a) 

>T 0 
0 c 
> t 
o 0 

o CL 

m ^ 

Q < 

3 >> 
w E 
o o 

— c 

CO o 

L- 

O) "00 

Q < 

C 



"O 

c 

0 

0 

L_ 

0 

_c 

« CO 

5 .9 

cl 2 

>,5 

T= _| 

-C 
O O 

0 cc 
^ q 



DC 



0 

^ O 

O iS 

11 

0 0 
,« Q 



■C CD u 
O c Q 

.52 O 0 



c — 
0 0 



-Q 

§ 



O - 
03 03 

0 o 

£- E 

O -^3 

— o — 
0 0 0 

— _0 DO 
0 

0 H 
Z 0 

03 0 
3 cl 
03 



E 

o 

c 

o 

L_ 
-I — • 

03 

03 

_C 

c 

o 

"-I — • 

03 

E 

L- 

£ 

_c 

o 

'c 

o 

L_ 
-I— » 

o 

0 

LU 







To Stand the Test of Time 





Data/Information in the VO The Virtual Observatory in Astronomy 

• Basic data ^ • The Virtual Observatory enables new science by 

- digital images, spectra, time series, , greatly enhancing access to data and computing 

catalogs, tables not discoverable resources. The VO makes it easy to locate, retrieve, 
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26-27 Sept 2006 ARL/NSF □ orkshop 7 26-27 Sept 2006 ARL/NSF □ orkshop 























Data integration Data integration 

Cas A supernova remnant 
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26-27 Sept 2006 ARL/NSF □ orkshop 26-27 Sept 2006 ARL/NSF □ orkshop 










Data integration The Key to the VO: Interoperability 

• Metadata standards 

• Data discovery 
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have maximum return on investment. Essential experience in the university libraries and build on 

legacy datasets are being lost. long-term institutional commitments to preservation. 
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26-27 Sept 2006 ARL/NSF □ orkshop 19 26-27 Sept 2006 ARL/NSF □ orkshop 
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26-27 Sept 2006 ARL/NSF □ orkshop 23 26-27 Sept 2006 ARL/NSF □ orkshop 
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26-27 Sept 2006 (cU999-2006 ULP/CNRS - Centre de Donrrfes astronomlques de Strasbourg 12 pUnes, 8 views. 694Mb 26-27 Sept 2006 ARL/NSF □ OrkshOp 
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Commit to collection: contract the manager! 
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Produce full look and feel as ingested? Expensive! ' Academic credit (citations?) 

• May also be unfamiliar for future consumer ' F ree_ l°aders (embargos?) 

Somewhere between? ' Disciplines are different! 

• Depends on goals... • Workforce skills: researcher, data librarian/scientist 
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“There’s plenty of money... there’s just not 
plenty of money for everything!” (Courant) 
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Workshop Participants' Position 

Papers: Overview and Three questions: 

Charge to Breakout Groups 
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Technical infrastructure and tools 
R & D (including prototypes) 

Science education and communication/outreach 
(scholarly and public, including professional training) 
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> That it becomes a platform for great 
things 




Appendix D. Breakout Session Reports 




1. Infrastructure Breakout 

Wendy Pradt Lougee and Richard Luce 

2. The Role of Academic Libraries in the Digital Data Universe 

Bob Hanisch and Brian Schottlaender 



3. Summary of Economic Sustainability Models Breakout Session 

Chris Rusbridge and Fran Berman 
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The role of library in the curatorial function, e.g. ? governance, 
standards, collection development, privacy, etc. 

Imperative of stakeholder representation 





Structuring the Problem OAIS Reference Model 
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http://public.ccsds.org/publications/archive/650x0b1.pdf 
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MANAGEMENT MANAGEMENT 
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Cross-disciplinary discovery 
Automatic generation of metadata 



The Role of Academic Libraries 
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(i.e. what is the newsworthy item that will get U.S. Good infrastructure must have a plan for “the end” 

competitive juices flowing and help generate new - how do we reappraise if necessary, how do we hand- 
funding for data management and preservation) offi how do we become self-sufficient? 




Eric’s Updated Version of the Cliff “Actionable” Recommendations 1 

Lynch model I " We don 7 get anywhere if we don 't start somewhere. ” 
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Use NSF program process to help the library 
community take more responsibility for the 

stewardship of research data (with other funders?) 













A Bolder Vision? Remember Dli! 
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Courant already doing? 
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after 1 0 years? 
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Appendix E. Position Papers 




1. Henry E. Brady 

2. Suzanne Carbotte 

3. Robert S. Chen 

4. Sayeed Choudhury, Robert Hanisch, and Alex Szalay 

5. Paul Constantine 

6. Peter Cornillon 

7. Bernard Dumouchel and Richard Akerman 

8. Stephanie Hampton, M. Jones, and M. Schildhauer 

9. Margaret Hedstrom 

10. Charles Humphrey 

11. John Leslie King 

12. Rick Luce 

1 3. Barbara Lust and Janey McCue 

14. James L. Mullins 

15. James D. Myers 

16. Frank Rack 

17. Mark Sandler 

18. Mackenzie Smith 

19. Eric F. Van de Velde 

20. Todd Vision 

21. Tyler 0. Walters 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Henry E. Brady, Professor of Political Science and Public Policy, Director of the 
Survey Research Center and UC DATA, University of California, Berkeley 

There are many problems confronting efforts to preserve and manage scientific data in digital 
form including making decisions about what to keep, developing plans for discarding data 
when it is no longer useful, providing adequate meta-data, ensuring long term preservation 
given frequent changes in media and software, and finding and training staff to do these tasks. 

I will focus, however, on three problems that are especially pertinent to the social sciences. 

Linkage of Data Sets. The social sciences are benefiting enormously from the easy availability 
of large-scale, computerized datasets such as vital statistics, census data, employment records, 
educational data, welfare and social security records, voting data, medical records, commuting 
and transportation data, and many other kinds of information. These datasets are even more 
useful when they can be easily linked together to study events or transitions such as the 
transition from welfare to work, from illness to a job, from school to citizen, from prison to 
everyday life, or from home to work. Coding of geographic, contextual, genetic, environmental, 
and other information can make these data even more valuable for understanding the impact 
of neighborhoods, institutions, physical distance, individual characteristics, and other factors. 
Yet these data come in many different forms (different types of databases, different units of 
observation, and various levels of reliability), and linking them poses significant challenges. If 
data libraries are to be truly useful for social sciences, they must provide users with the software 
tools to link these very-large and unwieldy data-sets easily, reproducibly, and reliably. 

Confidentiality. Although the availability and the linkage of social science data provide 
tremendous opportunities for answering important social science questions, they also 
exponentially increase the dangers of disclosing personal information through the possibility 
of identifying individuals - even though social science researchers are almost always interested 
in general statements about behavior and almost never interested in individuals. The problem 
of confidentiality has increased the requirements for Human Subject Reviews, decreased the 
availability of many kinds of data, and made linkage especially suspect. A number of technical 
and institutional methods are being developed to deal with these problems, but we are still far 
from having generally accepted approaches to them. Moreover, although confidentiality has 
been an especially difficult problem for the social sciences, it is increasingly a problem for the 
medical, environmental, and even the geo-sciences. 

Institutional Models. One answer to the problems of linkage and confidentiality is to develop 
better institutional models that provide ways that researchers can have access to data in ways 
that protect individuals while allowing for extensive data linkage. One example is the Census 
Research Data Centers which allow researchers to access non-public Census data under rigidly 
controlled conditions. This model, however, only allows for access on a case-by-case basis. 
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and it does not currently allow for long-term access to data. Institutions are also important 
for a larger reason: At the moment, we have nothing comparable to the "University Library" 
which has historically made rational acquisitions through "collection specialists" working 
with researchers, developed meta-data through classification and indexing, and paid for the 
development of documentation and the preservation of information. Thus, researchers with 
data typically do not have any place to go on the University campus, and even if there is an 
institution concerned with digital social science data, it is typically woefully under-funded and 
unable to help the researcher archive and preserve data. Some libraries and some computer 
centers have begun to take up these challenges, but each has other responsibilities and agendas 
which impede their efforts. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Suzanne Carbotte, Program Director, Marine Geoscience Data System, Lamont- 
Doherty Earth Observatory, Columbia University 

How do we ensure the engagement of data producers at various stages in the data life cycle. 

To develop comprehensive digital data resources of maximum use to the research and education 
community requires the active involvement and engagement of individual scientists who are 
essential data producers throughout the data life cycle. A diverse range of individual scientists 
may be involved at various stages throughout the data lifecycle as field data are processed, 
reprocessed for new applications, integrated into data syntheses, and higher level derived data 
products are developed. It must be easy (transparent?) for scientists to document and contribute 
their data products, scientists need to have their contributions adequately protected and 
acknowledged, and new rewards for contribution are needed. 

Inadequate enforcement of data policies. New governance structures for enforcement of 
data policies are needed. In some realms of scientific research, existing NSF data policies 
have been difficult to enforce, partly because appropriate digital data repositories have not 
existed, but also because mechanisms to document compliance are not in place. Another 
aspect of the problem is that the scientist may be the only one who knows of the existence of 
data and compliance of the individual scientist with a data policy must be based on trust and 
commitment to data preservation as part of the scientific process. 

Inadequate incentives for scientist participation in data preservation. New incentives for 
scientists to contribute to data collections go hand in hand with the need to fully engage data 
producers in the data preservation process and the need for new structures for enforcing 
data policies. We need to change how we reward and credit scientists for data contribution. 
Contribution to databases needs to be part of the publication process, and we need a new 
system of professional recognition that acknowledges the value of data production and 
contribution to data collections. New partnerships with the academic journals will be needed to 
develop policies for publication, which include linking publications to digital data resources. 

How do we ensure the long-term security of digital data collections in an uncertain funding 
climate? New scenarios and partnerships to ensure long-term funding are essential for both the 
development and security of digital data collections. To adequately manage and preserve the 
complex heterogeneous data that are produced in an increasingly multidisciplinary research 
environment requires data managers with a high level of expertise in both the domain sciences 
as well as information technology. Such people are difficult to find and keep in an uncertain 
and short term funding climate. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Format 

Comments by Dr. Robert S. Chen, Interim Director, CIESIN at Columbia University; Manager 
of the NASA Socioeconomic Data and Applications Center (SEDAC); and Secretary General, 
Committee on Data for Science and Technology (CODATA) of the International Council for 
Science (ICSU) 

Digital Rights Management. Clarification of rights to archive, use, and disseminate data 
and any applicable restrictions is essential to long-term data curation, and would be greatly 
facilitated by digital standards and technologies that permit data sources and users to quickly 
and easily specify and understand rights and restrictions in ways that meet their needs and 
concerns. This has been a key element in the success of the Creative Commons and needs active 
support in the realm of scientific data and information. Key issues that need to be addressed 
include protection of confidentiality, limitations on liability of data sources, use of data for 
humanitarian purposes, and definitions of appropriate uses (e.g., private sector vs. public sector 
research). 

New Institutional Partnerships. A range of new partnerships is needed across disciplines, 
within and between universities, between sponsors and data managers, across the public 
and private sectors, and with the broader scientific community to establish appropriate and 
sustainable long-term data management structures covering all or most of science. Is a mix of 
disciplinary, cross-disciplinary, and institutional repositories going to evolve that can provide 
sustainable data curation and management in most fields? Can we identify gaps and find ways 
to fill them? Are there contingency plans when a particular institutional arrangement for a 
particular field encounters problems with sustainability? Are there ways to involve the private 
sector and / or the open source community in these arrangements to help infuse interoperable 
technologies and reduce costs without risking long-term sustainability or access? Should 
existing consortia, e.g., of universities, of libraries, or disciplinary data centers, be asked to take 
on long-term curation responsibilities or are new ones needed? 

Science Education and Community Outreach. Many scientists continue to use traditional 
approaches to data, i.e., developing custom datasets for their own use with little attention 
to long-term reuse, dissemination, and curation. Although there has been considerable 
progress in data stewardship for "big science" projects, even modest collaborative projects are 
inconsistent in their attention to data management and few individual scientists think beyond 
posting selected results and data on the Internet or submitting a final data product to a data 
archive if required to do so. Changing this sort of behavior will require a range of efforts, 
including investment in approaches to make data documentation, sharing, and preservation 
easier, establishment of an infrastructure to accept and assume responsibility for data (e.g., a 
local university depository or a disciplinary data center), and, perhaps most important of all, 
concerted efforts to educate current and future scientists to adopt better practices. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Sayeed Choudhury, Associate Director for Library Digital Programs, Johns 
Hopkins University; Robert Hanisch, Senior Scientist, Space Telescope Science Institute; and 
Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, Johns 
Hopkins University 

The National Science Foundation's Cyberinfrastructure Vision for 21st Century Discovery 
describes a broad-based effort that is transforming the manner in which scientists, social 
scientists and engineers (and perhaps even the humanists) conduct research, teach and 
learn, and disseminate their research findings and publications. Projects such as the Virtual 
Observatory (VO) provide ample evidence of data-driven scholarship, which offers both 
challenges and opportunities for academic research libraries, especially in the realm of data 
curation. Even with ambitious new efforts to create large corpora of digitized text such as 
Google Book Search or the Open Content Alliance, libraries still represent a core element of the 
preservation picture. 

Given the scale and complexity of even a single cyberinfrastructure-based project such as the 
VO, it is not reasonable to assume that a single library or organization can manage the entire 
range of data curation needs. Rather, libraries must find ways to work together with an array of 
organizations such as other libraries, supercomputing centers, museums, archives, publishers, 
and corporations. For different stages and applications of data, various organizations will 
need to identify appropriate roles and develop systems that interface — technically and 
organizationally — with a range of partner institutions. Such a complex array of relationships 
and technological infrastructure may benefit from examination and leadership from the highest 
levels of the academic and corporate community. However, at this stage, there are major 
research and development questions that remain unaddressed. In this current environment, 
it's essential to develop prototype systems that demonstrate both technical and organizational 
infrastructure to support data curation. These prototype development efforts will help us better 
understand appropriate technologies, potential costs, and organizational relationships that will 
be necessary to support cyberinfrastructure-based projects and programs. 

At Johns Hopkins University, we are working with a network of libraries, publishers, scholarly 
societies, and corporate partners to develop a repository-based system that will support an 
end-to-end process for capturing, curating, preserving, and providing access for the long term 
to derived data that is cited in electronic publications. We are prototyping such a process and 
system. Our goals include assessing the scientific impact of this new approach to astronomical 
data, as well as working out sustainable business models for increasing the value of data in 
this way. Our prototype phase focuses on astronomy because of the technological maturity of 
electronic publications and data management in this discipline, and because of the wide access 
to digital data archives, and the unique, established relationship between the astronomers and 
libraries at Johns Hopkins University. 



Long-term Stewardship of Digital Data Sets in Science and Engineering • 123 



Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Paul Constantine, Associate Dean for Research and Instructional Services, 
University of Washington Libraries 

Within the context of New Partnerships here are what I see as key issues facing us. 

Long-term Preservation 

Appraisal 

o There is simply too much data to preserve everything over the long term; we need to devise 
ways to determine exactly what data is worthy of long-term preservation, 
o Can we develop an understanding of how data might be re-purposed in ways totally 
unpredicted by its original creator/ gatherer/ researcher? 
o What legal/ ethical constraints encourage or discourage the long-term preservation of 
particular datasets? 

Management 

Who is best positioned to manage long term preservation? 

o Scientists are often ready to move onto to their next project and don't necessarily see the 
value/ need or have the resources to preserve their data over time, 
o Data sitting on a scholar's computer are not easily accessible to other researchers wishing to 
use and/ or repurpose it. 

o Campus Computer Centers are not always funded or equipped to preserve data for the 
long-term. 

o Neither researchers nor computer centers are especially used to creating metadata schema 
and assigning metadata in ways to make datasets more easily discoverable. 

0 Libraries, while skilled at creating metadata schema and assigning metadata in ways to 
make datasets more easily discoverable, are generally not funded or equipped to preserve 
digital data for the long-term. 

Curation of Scientific Data in Digital Form 

1 see curation in many ways as the overall "thing." Long-term preservation and managements 
are components of data curation. 

Funding 

o Many libraries and computer centers are not funded to provide data curation 
Participation 

o Some researchers need to be convinced of the importance of data curation and the long-term 
preservation of their datasets, 
o Convincing researchers to share their datasets. 
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Intellectual Property Issues 

o Have the data been copyrights or their use patented? 
o Are the data licensable? 
o Who owns the data? 
o What restrictions have they imposed? 

Preservation and Access 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Academic Libraries in the Digital Data Universe: The Reference Desk and Technical Memoranda 

Comments by Peter Cornillon, Professor of Oceanography, Graduate School of Oceanography, 
University of Rhode Island 

There are many issues that relate to the role of libraries in a digital age. The two that I have been 
concerned about for a long time are: 

The Reference Desk. Finding data of interest and then accessing these data can be extremely 
difficult. It is often difficult to know just where to start and there appears to be a widening 
gap in the expertise of the reference librarian in this regard and the state of the technology. For 
example, there exist today a number of high level directories that will help one find data sets 
of potential interest — the Global Change Master Directory (GCMD) for the Earth sciences, the 
National Space Science Data Center (NSSDC) Master Catalog for the space sciences, etc. — but 
few reference librarians are aware of these or how to use them. In fact, the expertise in data 
discovery has moved away from the library to the researcher. This is both a problem and an 
opportunity. It's a problem for the new researcher or student who is looking for data; they 
have to go through the same, often painful, discovery process as all of their colleagues. It's an 
opportunity in that expertise exists at many institutions to help train new librarians in these 
areas and to retrain librarians already on the job. Unfortunately, this is an opportunity that is 
not been exploited. NSF might investigate the funding of courses at library schools that draw on 
the data discovery talents of the local research community. This course could be offered both as 
a recertification opportunity for reference librarians as well as basic instruction for students in 
library science programs. The course could also address data access methodologies. 

In addition to the research community benefitting from more expertise in the library with regard 
to data discovery, there would also be a direct benefit to those developing data discovery and 
access methods from more input from the library community. Bottom line: there are a number 
of benefits that would derive from a tighter coupling of the research and library communities as 
relates to data discovery and access. 

Technical Memoranda — Gray Literature. Although universities have taken the lead in the 
development of end-to-end data systems in a highly distributed environment, there is one area 
in which they have taken a significant step backward. In the past researchers often "published" 
their data in paper form as technical memoranda, or some equivalent and these reports were (and 
still are) archived in the university's library. With the advent of the Web such technical reports 
have all but disappeared with researchers "publishing" their data on personal Web sites; i.e., 
the institutional commitment to a long-term archive of the data has all but disappeared. This 
is a trend that must be reversed or much of these data will be lost forever - universities must 
provide a mechanism for researchers to "publish" their data electronically for permanent 
archival in the university library. 
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Acronyms 

GCMD Global Change Master Directory 
GSO Graduate School of Oceanography 
NSF National Science Foundation 
NSSDC National Space Science Data Center 
URI University of Rhode Island 

In this position paper, my references to finding and accessing data refer to finding and 
accessing digital data on the Web; i.e., finding and accessing remote repositories of data. 
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Data in Digital Form 

Comments by Bernard Dumouchel, Director General, Canada Institute for Scientific and 
Technical Information (CISTI), and Richard Akerman, Technology Architect-IT, Architecture, 
CISTI 

We feel that there is a role for CISTI in promoting the use and stewardship of Canadian research 
data, in order to maintain the position of Canadian science. As well, we foresee a potential role 
in coordinating open data activities, similar to the UK Digital Curation Centre (DCC) model. We 
would like to explore new partnerships in this area. We feel that there are many possibilities for 
enhanced use of and access to data, such as wider and richer linking of data to publications. 

The Internet is enabling much greater openness in several areas: 

Open access is partly about publication funding models (which will not be discussed herein), 
but also importantly about providing free, public access to information. 

Open data is generally less contentious that open access, as the funding model aspect is less 
important. There is general agreement even amongst publishers that scientific data should be 
open to all (with appropriate privacy and other constraints). 

Open discourse is about broadening the scientific discussion beyond the confines of traditional 
venues. Without the constraints of a printed page or a conference session, rich discussion is 
possible both amongst scientists and between scientists and the general public. 

Libraries are well positioned for these trends, which could be considered part of the 
development of Open Science. Research libraries can play a role in the promotion and 
understanding of all of these areas, as well as potentially providing infrastructure or 
coordination. 

In terms of the infrastructure aspects, to some extent these are already being addressed by 
existing e-Science or cyberinfrastructure programs, although they have a focus more on 
computing and storage resources for researchers. 

The Canada Institute for Scientific and Technical Information (CISTI) has a long history of 
participation in the realms of scientific computation and scientific data. In particular, we have 
a longstanding role in cataloguing data and promoting its use. CISTI hosts the Canadian 
secretariat of CODATA, participates in ICSTI, and was involved with the production of a report 
on Canadian access to scientific research data (the NCASRD report). We provide a Depository of 
Unpublished Data and our Research Press journals support the concept of supplementary data. 

There are a number of issues that need to be discussed: Can the concept of trusted digital 
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repository be extended to data repositories? What additional management elements and criteria 
will be needed? How can we deal with existing scientific data, which may be well managed 
but not necessarily compliant with any repository or access standards? How shall data be 
catalogued and identified, particularly across scientific fields? How should data sets be cited, 
how can versions be handled as data sets grow and are updated? 

We look forward to discussing these and other issues as we explore the digital data universe 
together. 

References 

Digital Curation Centre 
http:/ / www.dcc.ac.uk/ 

Canadian National Committee for CODATA 
http: / / www.codata.org/ Canada/ tofr.shtml 

Report on Data Activities in Canada 
http: / / dac.cisti.nrc.ca/ datact_e.cfm 

Canadian National Consultation on Access to Scientific Research Data (NCASRD) 
http: / / ncasrd-cnadrs.scitech.gc.ca/ 

Depository of Unpublished Data 

http: / / cisti-icist.nrc-cnrc.gc.ca/ irm/ unpub_e.html 

Research Press - Supplementary Data 

http: / / pubs.nrc-cnrc.gc.ca/ rp / rptemp/ rp2_news2_e.html 

Networks: recipe for success in the knowledge age 

http: / / cisti-icist.nrc-cnrc.gc.ca/ media/ news/ cn20n3_e.html #aO 

open discourse + access + data equals open science? 

http: / / scilib.typepad.com/ science_library_pad/2006/ 09/ open_discourse_.html 
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Managing Collections of Highly Dispersed, Hererogeneous Data 

Comments by S. Hampton, M. Jones, and M. Schildhauer, National Center for Ecological 
Analysis and Synthesis (NCEAS), University of California, Santa Barbara 

Libraries have historically played a critical role in the long-term preservation of scholarly 
works, especially books and other artifacts written in natural language. Curation of scientific 
data has challenged the capabilities of existing systems because of the unique characteristics 
of scientific data and the special services which must accompany archiving of scientific data. 
Unlike books and similar publications, scientific data are generally intended for analysis, 
modeling, and visualization rather than reading and browsing. Analysis and modeling 
activities often require the quantitative integration of multiple data sets from dispersed 
locations that vary tremendously in their structure and semantics, which creates a need for 
much more detailed metadata than is available in traditional library usage. Depending on the 
discipline, scientific data can be small and complex, requiring substantial documentation to 
accurately interpret the data, or large but uniformly structured, requiring less documentation 
but presenting system scalability issues. These and other fundamental differences in the way 
one uses scientific data lead to the need for new partnerships that can effectively provide for 
simultaneous preservation, discovery, access, integration, and analysis of data. 

Heterogeneity and Dispersion. Dealing with heterogeneity and dispersion in the context of 
integration, analysis, and modeling is the major key to successfully building data collections. 
Disciplines that are strongly bounded along lines of similar information (such as genetics and 
proteomics) can have highly integrated solutions like GenBank or the Protein Data Bank, yet 
other fields (like Ecology) can vary widely in the types of information that are necessary within 
the context of a single study (e.g., a single ecological study may require data from population 
biology, genetics, hydrology, and meteorology). Consequently, one archival solution might 
not fit all disciplines, unless that solution provides interfaces that enable both breadth of 
coverage and depth of resolution within any given discipline. For example, traditional metadata 
systems used in libraries provide metadata that assists with discovery at a coarse-grained 
level, but understanding heterogeneous data requires detailed metadata that describes the 
structure, content, and semantics of data and the protocols used to generate the data. Although 
metadata standards overlap tremendously, discipline-specific extensions must be created to 
fully understand and utilize data. Data dispersion can play an important role in structuring 
collections. Local institutions are typically the best curators of scientific data because they best 
understand the data collection and quality assurance processes. Although specific versions of 
scientific data sets are static and can be preserved as-is, data users find errors and omissions that 
are fixed in subsequent versions, requiring an active curatorial system that dynamically links 
actions of local scientists to regional, national, and global archives. For libraries to provide an 
effective archival system for science data, they must build semantically rich data infrastructure 
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that allows direct access to heterogeneous data directly from within analysis and modeling 
systems used by scientists and that allows for curatorial linkages among data systems from 
local, regional, national, and global scales. 

Long-term Preservation. Many disciplines, including ecology, lack a mechanism for assuring 
the long-term preservation of scientific data. Although some nationally scoped data archives 
exist (e.g., the NASA DAAC's, the NODC, etc), many of these are federally funded and are 
subject to the vagaries and cycles associated with public federal funding. These archive centers 
tend to focus on archiving data without fully dealing with the difficult and expensive aspects 
of long-term curation, including the creation and maintenance of new data versions, media 
migration, and software obsolescence. A partnership of libraries and data centers that each 
contains replicas of scientific data linked to local mechanisms for data curation and update 
would be far more durable over the long-term than single, centralized data archive systems. 

Data Sharing. Despite widespread agreement that sharing data is paramount to the scientific 
method and essential to synthetic advances that span scales and disciplines, institutional and 
individual sociological barriers to intellectual rights of data use remain a serious problem. 
Diverse approaches to preserving data that gradually migrate disciplines into more open 
and unquestioned sharing of data will benefit science but require new partnerships among 
scientists, data centers, libraries, scholarly societies, and universities. One approach is to 
provide incentives to data sharing directly to scientists, e.g., as the Kepler scientific workflow 
system has done by directly linking analysis and modeling capabilities to data archives and 
sensor networks. New partnerships that promote generic data access interfaces allow us to build 
integrated systems that scientists can use to access data archives during the course of analysis 
and modeling, thereby providing an incentive to share data. 
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Data in Digital Form 

Sustainable Economic Models 

Comments by Margaret Hedstrom, Associate Professor, School of Information, University of 
Michigan 

The Intensive Care Unit Analogy. As a society, we are alarmed at the rising cost of health 
care. Think of what health care would cost if all of the patients were in the Intensive Care Unit 
or the Emergency Room. What is the analogy to long-term preservation? All of our patients 
(that is all of the data that needs to be diagnosed and treated for the disease of decay) today 
are in the ICU or the ER. [This is a bit polemical, but I hope you get my point.] An affordable 
(e.g., economically viable) system for long-term preservation requires preventative medicine, 
a system of diagnosis of similar problems, treatment protocols and good practices, criteria for 
triage, processes and tools that support healthy data, and an infrastructure oriented to health of 
data rather than illness. Getting to this point will involve research (what are the good practices, 
which types of diseases affect which types of data, how do we motivate data producers to be 
"health conscious" about their data, etc. etc. especially in the absence of a known or quantifiable 
future demand. 

The Value Proposition. If we take as a given that not all data are created equal and that we 
will not be able to afford to keep everything, how do we decide where to invest in preserving 
data. This is fundamentally an information problem. How do we make effective economic 
decisions in the face of uncertainty about the supply of data and the future demand? Are there 
any economic models or research issues that provide insights into comparable problems? What 
happens when the future value of a particular set of data is contingent upon its relationship 
to other data that have been preserved and can therefore be aggregated? At what level of 
granularity do we make selection (e.g. investment) decisions, given that deciding what to 
preserve is a very labor-intensive and expensive process. 

Public Goods with an Unknown Future Value. I assume that there are numerous similar cases 
of public goods with an unknown future value, but how can we learn from these and make a 
similar case for long-term preservation of [the right] digital data? I think we could also leverage 
present value, but we need some good examples. 

Disclaimer: I am not an economist, not do I play one on TV. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Charles Humphrey, Academic Director, Research Data Centre, and Head, Data 
Library, University of Alberta 

Preserving scientific data is a process best characterized by a life cycle model that differentiates 
the various stages through which research is conducted. The life cycle perspective helps 
visualize a global representation of this process and helps identify the digital objects produced 
throughout the full research cycle. Within each stage are outputs in digital format that record 
or summarize research activities. For example, literature reviews, research prospectuses and 
grant applications are typical products of the Study Concept and Design stage that later become 
important sources when documenting data. Some of these products are specific to a particular 
stage while others are passed between stages. For example, a data file from the Data Processing 
stage will be passed to a subsequent Analysis stage. The life cycle model helps monitor both the 
digital objects bound within a stage and those objects that flow across stages. 

This type of model also depicts the wide mix of implicit and explicit partnerships that occurs 
during research, including relationships among researchers, grant agencies, universities, data 
producers, scientific publishers, libraries, data repositories and others. New scientific research 
is stimulated by the intellectual capital, resources and infrastructure brought together through 
such partnerships and much of today's research is shaped by these interdependencies. The big 
picture from the life cycle model ensures that the combination of relationships within a project 
is recognized and well described. 

Given a life cycle perspective, what are the key issues of long-term preservation, data 
management and the curation of data in the context of new partnerships, new organizational 
models and sustainable economic models? 

New Partnerships. The preservation of scientific data is dependent on the custodial care of the 
digital objects produced throughout the research process. The traditional practice of gathering 
a paper trail of research outputs long after a scientific investigation has been concluded and 
depositing them with an archive is inapplicable in the digital era. Too much valuable research 
data are either at high risk of being lost or have been destroyed because of inappropriate 
practices that were carried over from a time when paper was the dominant medium. In the 
digital era, the challenge is to coordinate among partners the care of research data throughout 
the life cycle. Digital custodianship requires clearly articulated roles for the care of the digital 
objects, including which partner has primary responsibility in each stage of the life cycle. 

New possibilities exist for librarians to serve as partners in the life cycle of research. Today, 
data librarians are on staff in many academic research libraries where collections of data 
are made available through library data services. For the most part, data librarians are not 
engaged in primary research being conducted on their local campuses. Instead, they mainly 



Long-term Stewardship of Digital Data Sets in Science and Engineering • 133 



support researchers undertaking secondary data analysis. While this is an important service, 
the potential for data librarians to be much more involved in activities across the life cycle of 
research remains untapped. For example, data librarians could contribute significantly to the 
management of metadata throughout a research project. 

Research data require high quality metadata for proper preservation and to be of value for re- 
use. New metadata standards are emerging that facilitate the discovery and repurposing of 
data. The enhancement of such standards requires participation by all of the partners in the life 
cycle of research. With the production of comprehensive metadata based on open standards, 
new partnerships will be needed to develop open-source tools for mining this rich metadata. 
Research libraries should provide access both to the body of scientific literature and to the data 
upon which this literature is based. The publishers of scientific literature and the providers 
of library data services need to agree upon standard metadata elements that will facilitate the 
dynamic linking of data with scientific literature. With such metadata in place, new partnerships 
can be forged to develop the tools for integrating data with literature. 

New Organizational Models. Short-term access to data is often best facilitated by keeping 
the data in close proximity to its origins. However, long-term access is completely dependent 
upon thorough preservation practices and standards. One challenge we face is to establish 
a network of organizations with varying levels of responsibility to span the life cycle of 
research. For example, a local digital repository may take initial responsibility for providing 
access to research data but the long-term preservation and access becomes the responsibility 
of a topical (discipline-specific) or general national repository. Coordinating the division of 
responsibilities across multiple digital repositories is a major organizational task. A model 
based on a federation of data repositories is one approach that would address the need for 
strong organizational coordination. 

New data repositories of national prominence need to be launched that take on the long-term 
responsibilities of preserving data and that work closely in coordination with local repositories 
responsible for short-term access to data. An open consultation is needed to determine how 
many of these repositories are required and whether their focus should be general or topical. 

The emergence of local digital repositories with recognized responsibilities for the care of 
research products requires a certification process to ensure best practices and to build a level of 
trust between researchers and the providers of repository services. The work by the joint digital 
repository certification task force between the Research Libraries Group and the U.S. National 
Archives and Records Administration has provided a framework for such a system. One or 
more organizational homes will be needed, however, to implement a certification process. 

Sustainable Economic Models. One of today's most serious threats to science is the 
commodification of research data, which includes the acts of selling research data at a cost 
in excess of the Bromley guideline and of inappropriately hoarding data under the pretext 
of intellectual ownership. Science flourishes in an environment of openness where ideas are 
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exchanged, challenged and tested. This principle of openness also applies to the data upon 
which research findings are based. The replication of research depends on an open exchange 
of data. The challenge we face as a scientific community is to find ways of preserving and 
exchanging research data that are not based on the commodification of research data. If we 
accept the premise that scientific research data must be a public good, how will the services to 
preserve and provide access to the data be financed? 

1 

A position statement that I wrote for the ARL E-Science Task Force presents an example of a 
research life cycle model. See "e-Science and the Life Cycle of Research" (2006) available online 
at http: / / datalib.library.ualberta.ca/ -humphrey/ lifecycle-science060308.doc. 

2 

Life cycle does not refer to life span, which is the time from birth to death. Rather, life cycle 
is used to describe the processes within an environment under which resources are formed, 
transformed and re-used. For a brief summary of life cycle models and references to examples 
in addition to the one previously cited, see the item by Ann Green, "Conceptualizing the Digital 
Life Cycle" on the IASSIST Communique at http: / / iassistblog.org/ ?p=26. 

3 

Original data collection is a defining aspect of primary research. 

4 

The Data Documentation Initiative (DDI) is an example of such a metadata standard for 
survey, aggregate and time series data. Version 3 of DDI introduces a metadata model based on 
the life cycle of data. For further information, see http: / / www.icpsr.umich.edu/DDI/. 

5 

For a recent, comprehensive discussion of the issues of digital repositories and their 
applications with research data, see Ann Green and Myron Guttmann, "Building Partnerships 
Among Social Science Researchers, Institution-based Repositories and Domain Specific Data 
Archives," 2006. Pre-print deposited in: http: / / deepblue.lib.umich.edu/. 

6 

An example of a federation of repositories in the social sciences is the Data Preservation 
Alliance for the Social Sciences (Data-PASS), which is supported by the Library of Congress 
National Digital Information Infrastructure and Preservation Program. For more information, 
see http: / / www.icpsr.umich.edu / D ATAPASS / . 

7 

The argument for new national data archives of prominence has been made by James Jacobs 
and myself in "Preserving Research Data," Communications of the ACM, Vol. 47 (9), pp. 27-29. 
For further information about the RLG-NARA digital repository certification task force, see 
http: / / www.rlg.org/ en/ page.php?Page_ID=20769. 

9 

The concept of pricing data at the marginal cost of reproducing a copy of the data is one of 
the Bromley Principles, which was published by the Committee on Earth and Environmental 
Sciences, National Science Foundation in "Data Management for global change research policy 
statements July 1991" U.S. Global Change Data and Information Management Program Plan, 
Washington, DC. 1992, pp. 42-48. 
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Data in Digital Form 

Comments by John Leslie King, Vice Provost for Academic Information, and Professor, School of 
Information, University of Michigan 

Access and Agency. The digital data universe brings significant changes in fungibility and 
interactivity in the origination, movement and use of data, information, and presumably, 
knowledge. The concept of "source" must be broadened to include not only human-mediated 
works of the kind we are used to, but direct sensory data from machine sensor networks and 
machine-constructed works at all levels from simple tabulation to syntheses that can take many 
forms (text, visuals, audio). These can be moved around at virtually zero marginal cost, and 
mashed-up into new creations by machines or humans for uses not even foreseeable at this time. 
Unlike the current "broadcast" model of most data/ information transfer, in which a producer 
"sends" content to a consumer, the new model will embrace a growing population of producer / 
consumers whose roles are more difficult to distinguish. And, to top it off, if historical trends 
hold, this access and agency will eventually extend to the population of humans with access to 
telephony — at present about 2.8 billion people and growing rapidly (figure half the population 
of the Earth by 2031, a mere 25 years away). 

Productivity Implications. Traditional labor productivity is simply the output produced as a 
function of labor input. Productivity is not very well understood in the realm of knowledge 
work, but everyone seems to think knowledge work is the important work for the 21 century. 
Assuming the traditional case - that dramatic changes in factor costs (e.g., the cost of moving 
information around) alter factor ratios (e.g., the amount of information that one person can 
provide to the population) - we can expect astonishing improvements in knowledge work 
productivity. We have been stuck trying to answer the question of what value academic libraries 
really provide for the academy and for the society at large. Other than the usual shibboleths 
about public goods and conservation of human knowledge, our stories are pretty anemic. 

This problem might soon change to one of explaining why we are not moving more quickly to 
provide the enormous benefits available to the world in the digital data universe. Time to think 
outside the stale old box. 

Who's to Say? We've gotten used to knowing what's "real" and what's not in the realm of 
information and knowledge because we've built a huge credentialing infrastructure to answer 
such questions. We are moving to a world where producers and consumers are increasingly 
indistinguishable from one another, and the traditional production pathways can be ignored, 
along with their credentialing mechanisms. It will become more difficult to challenge the 
veracity and reliability of particular "entries" in the digital data universe, but more troubling, 
it will be increasingly difficult for anyone to claim and hold the authority to decide the answers 
to such questions. This should be of particular concern to academic libraries, which are residual 
claimants on such authority in many societies. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Rick Luce, Vice Provost and Director of the University Libraries, Emory 
University 

Many challenges confront the preservation and access to digital scholarly information, as 
well the development of new capabilities which extend beyond the analog print paradigm. 

At the heart of these challenges lie infrastructure issues that surround the requirements for 
interoperability 

Data Model Interoperability. Beyond simple Web portal access and harvesting static 
repositories, raises issues related to defining (and implementing) the data model to utilize 
for digital objects, which must be commonly represented across heterogeneous, non-static 
repositories. The data model is important as a higher level of abstraction that must persist over 
time and support the following: 

• Abstraction for digital objects, required so digital objects can be seen as an instance of the 
class defined by the data model, and provision for a level of abstraction which persists over 
time regardless of evolution of changing technologies and formats; 

• Definitions for roles and quality assurance pertaining to the creation and maintenance of 
metadata, both man and machine generated; 

• Quality assurance pertaining to the curatorial role of automated datasets; 

• Rights, from confidentiality to DRM; and 

• Sustainability (economic, social, organizational, technological). 

Repository Interoperability. Enabling new value chains initiated in repositories that are: 

• Cross-repository interoperable and federated. Note: repositories may organized by domain, 
discipline-orientation, institution or organization, type (e.g., dataset, learning object, format), 
etc., however, they should not be considered static nodes in a communication system 
merely tasked with archiving digital objects, and making them accessible through discovery 
interfaces. Rather, these repositories are be part of a loose, global federation of repositories, 
and scholarly communication itself is regarded to be a global workflow (or value chain) 
across dynamic repositories; 

• Support a set of core services utilized via both machine and human user interfaces; and 

• Facilitate emergence of richer cross-repository services. 

Ecological Interoperable. Which enables: 

• Persistent communication infrastructure, independent of changing technology, which 
records and expresses the origin and authority of the unit of scholarly communication; 

• Global and automatically executed workflows and grid-enabled workflows which support 
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use and reuse across scholarly repositories; 

• Data provenance in a heterogeneous networked environment; 

• Distributed interoperable instruments and sensor-based registries; 

• Information filtering which automatically pushes information to the user(s); and 

• Emergent forms of social software, collaboration environments and networked based user 
profiles and network traversal log activity analysis. 

Increasingly, value resides in the relationships between papers, their associations, and the 
supporting data sets and materials. To manage and utilize the potentially rich and complex 
nodes and connections in a large knowledge system such as the distributed Web, system-aided 
reasoning methods would be useful to suggest relevant knowledge intelligently to the user. 

As our systems grow more sophisticated, we will see applications that support not just links 
between authors and papers but relationships between users and information repositories, and 
users and communities. What is required is a mechanism to enable communication between 
these relationships that leads to information exchange, adaptation and recombination. A 
new generation of information-retrieval tools and applications are being designed that will 
support self-organizing knowledge on distributed networks driven by human interaction to 
support trans-disciplinary science. Through the use of these new tools, we will derive a shared 
knowledge structure that is based on users and usage in addition to that provided by author 
citations. Thus, the aggregated connections that readers make between papers and concepts 
will provide an alternative conceptualization of a given knowledge space. Such techniques will 
be coupled with classical search and retrieval methods, and these capabilities have an obvious 
utility for discovering and supporting evolving knowledge from these networks. The same 
concepts can be applied to data sets and rich media sources. 

This emerging adaptive Web will analyse and use the collective behaviour of communities of 
users, utilizing concepts such as adaptive linking, which facilitates the evolution of knowledge 
structures based on collective user behaviour over time, and spreading activation, which 
uses a memory-recall process model from cognitive psychology. For example, using known 
keywords to search across distributed open archives, a user would receive recommendations of 
other conceptually related keywords, relevant articles, data sets and so on, based on semantic 
proximities linked across a multitude of distributed information resources. At the same time, the 
knowledge system the user has interacted with can begin to reorganize itself by incorporating 
feedback from the interaction into its knowledge structure. From the user perspective, such 
systems can use adaptive webs as a communication fabric to manage and co-evolve the 
knowledge traded with communities of members and users. Correspondingly, these new tools 
and systems will influence the adaptation of the structure and semantics of scientific discourse. 
Many questions remain unresolved, such as how we evaluate the knowledge structures and 
representations of such size and complexity. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Basic theme: Organizational Models 

Comments by Barbara Lust, Professor, Human Development, and Janet McCue, Associate 
University Librarian for Life Sciences, Cornell University 

Engaging the Research Community. Research scientists now face the necessity of a major shift 
in "Zeitgeist" of how they must think about their data, and their labs in general, if they are to 
take advantage of the power and promise of the cyber-infrastructure-based digital environment 
in which we now exist. If research data sets created by individual research labs are to become 
part of a national or international digital data framework which is "open, extensible, and 
evolvable," they can no longer function in isolation. For example, they must adopt revised 
methods of data management which insure data preservation and data sharing. Unless the 
research community can be brought to understand the significance and the usefulness of such 
changes, they will not adapt. 

Partnerships Between Research Labs and University Library and Across Libraries. 

Establishing partnerships between research labs and the university library provides a possible 
new infrastructure which is essential to the overall goal (a new "system of science" in the 
digital framework). This partnership fosters an environment where library and research lab 
do not function in isolation from each other. However, this infrastructure requires: (i) personal 
investment from both research lab and library personnel; (ii) interpersonal collaboration at 
an infrastructure level; (iii) university level support; (iv) new middleware for collaboration 
and coordination; (v) reciprocal adaptation by both the research lab and the library to the 
information structure of the particular data and materials involved. These needs arise at a 
repeated but higher order level when inter library exchange is developed. 

Establishing Knowledge Networks and Related Ontologies Which Bridge Numerous 
Research Centers. New intermediary infrastructures can potentially help bridge the divides 
that now exist between individual research labs, between institutions housing these labs, and 
between lab and university libraries. One such structure is the "Virtual Center" in a knowledge 
area. 

Basic theme: Sustainable Economic Models 

Barbara Lust (Professor, Human Development) and I (Associate University Librarian for Life 
Sciences) are co-PIs on a project at Cornell that relates to research data and library /laboratory 
collaboration. The purpose of our NSF Small Grant for Exploratory Research is to test the 
feasibility of extending the role of large research libraries in supporting value-added services 
for research data, including access, metadata, outreach, training, and archiving. The exploratory 
grant was focused on one laboratory (Language Acquisitions Lab); a supplemental award 
targeted a second research group (Agricultural Ecology Program). In the supplement, we are 
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evaluating whether the conceptual model developed for language acquisition data is applicable 
to other disciplines. For an overview of the project's goals and accomplishments, see the project 
Web site, http: / / metadata.mannlib.cornell.edu / lilac / 

Based on our experience in this planning grant, there are many significant issues to address 
in considering sustainable economic models, including capacity building, scaling-up, and 
determining future costs. 

Staffing. Although the Teragrid is a reality, it will only reach its full potential when it is heavily 
traveled by a broader spectrum of the research community. It is a significant challenge to build 
the human capacity to deliver data and associated services in ways that support the research 
community. We will need skilled programmers to develop the tools for data-driven research 
and facilitate discovery and access; agile librarians with strong academic backgrounds to curate 
the collections and support end users; and sympathetic researchers who understand the value 
of good metadata, best practices, and archival decisions. For example, in our work with Lust's 
lab, both the metadata librarian and the programmer have linguistics backgrounds; in our AEP 
grant, our Research Data /Environmental Sciences Librarian, who has a graduate degree in 
Ecology & Evolutionary Biology, works closely with the 12 co-PIs in the project. Having these 
specialized backgrounds allows the library to more easily translate the needs of the research lab 
into services and to understand the curatorial and preservation aspects-of the data. 

Scaling-up. We are working with two small projects in two labs within a single university, and 
some close collaborators. How do we scale-up to deal with oceans of data in diverse disciplines 
bridging multiple institutions? Can we leverage what we learn with one project and apply it to 
another? Can we mainstream some activities so that specialized staff consult and support staff 
process? Can we do a better job of capturing data at the point of creation, in formats that can be 
made accessible, re-purposed, archived, and mined? 

Estimating Future Costs. It is difficult to determine long-term costs and long-term 
commitments when the models are still evolving. How do we determine long-term costs when 
the issues related to long-term availability/ preservation of data are still puzzling us? If we 
develop collaborative repositories, how invested will the individual institutions and individual 
researchers be in sustaining of a cross-institutional entity? Can the costs be generalized for other 
institutions and other disciplines? Who is likely to bear the costs associated with research data 
discovery and preservation — research institutions? Granting agencies? Will STM vendors or 
universities or new entities offer subscriptions to institutions for services related to research 
data and will institutions/ researchers be willing to pay for those services? 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

A Statement for the Management of Massive Datasets: New Partnerships 
Comments by James L. Mullins, Dean of Libraries, Purdue University 

In the future, a researcher will read an article multi-dimensionally. Not only will there be text, 
with hyperlinks to related literature or citations within the article, there will be links to the data 
reported within the article, through graphs, tables, illustrations, that will link to related datasets. 

These datasets will be organized and retrieved based upon accepted, documented and well 
understood taxonomies and ontologies within the discipline, crosswalks will link, automatically, 
between one taxonomy and another to lead a researcher from one field of enquiry to another, 
thereby making connections from the findings and results in one field to another. 

Disciplinary scientists, computer scientists, computer technologists, and librarians as data 
scientists within university, professional societies and other research entity will work as teams 
to collaborate, from creation of the research project to the dissemination of the findings to the 
curation of the data for the present and for the future. 

Creating a "community proxy" will be the work of the disciplines in collaboration with 
librarians to determine the logical description and structure in which data would be organized 
and accessed. Massive data set repositories will be distributed around the world, in locations 
adjacent to related research centers, providing access to the international research community. 
Universities, research laboratories and governmental units will share in this undertaking, 
each picking off a "piece" of the massive undertaking, as exemplified by the recent agreement 
between the San Diego Super Computer (SDSC) and the National Archives and Records 
Administration (NARA). 

This brave new world will be time consuming, challenging and expensive to create. It will be 
imperative that research-funding organizations such as the National Science Foundation step up 
to help facilitate and cause this vision to materialize. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

The Coming Metadata Deluge 

Comments by James D. Myers, Associate Director, Cyberenvironments, National Center for 
Supercomputing Applications, University of Illinois at Urbana-Champaign 

With the rise of computing, our ability to produce content - from raw data to summarizing 
documents - has exploded. People speak of "information overload" and "data deluge" to 
describe the problems caused by this explosion for those trying to find, analyze, comprehend, 
and store the growing body of material available. To date, the automation of processes to 
capture the history and context of data has not kept pace, making the development of systems 
for data preservation, curation, and discovery extremely labor intensive. However, it is not clear 
that this situation is permanent, and, in fact, there are many reasons to think that there will 
soon be a 'metadata deluge' as our ability to capture and share metadata catches up with our 
capability to produce data. If such a metadata deluge occurs, it would profoundly affect the role 
of libraries and the design of curation and preservation infrastructure. An analogous change 
occurred with the data deluge - as our ability to create, store, and share content increased, our 
ability to organize information became a bottleneck, and infrastructure such as the World Wide 
Web arose. The Web, which directly supported the ability for experts and non-experts alike to 
organize information, created a market for third-parties to re-organize existing material and 
for 'competing' entities to offer alternate organizations. The Web also enabled those without 
the means or expertise to maintain content to none-the-less develop collections. More recent 
innovations such as blogs, wikis, and community spaces (e.g. MySpace and virtual 3-D worlds 
such as Second Life) go even further in enabling content creation and organization without 
technical expertise or owned infrastructure. 

While this stack of Web technologies does not support the requirements for creating, managing, 
curating collections and for long-term information preservation, there are emerging extensions 
in this area that have the potential to spark a transformation in these areas analogous to the 
Web transformation of information publishing and organization. For example, global identifier 
schemes such as Handles and Digital Object Identifiers and Life Science Identifiers now 
provide Web gateway mechanisms to create persistent URLs. More recent schemes such as 
the Archival Resource Key (ARK) bring a more Web-centric view and additionally provide a 
means of decoupling the roles of the initial information provider and subsequent curator(s) 
in the way identifiers are generated and in how metadata is attributed. Extensions to HTTP 
such as WebD AV and URIQA provide means of managing versions and metadata in XML and 
RDF formats. Specifications such as the Java Content Repository API (JSR 170 and JSR 283) 
are standardizing the same types of functionality at the programming interface level. These 
technologies are making their way into large data grid and repository software, but they are 
also being used directly in scientific applications and environments to enable up-front capture 
of metadata and data provenance information. For example, the CombeChem project has 
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developed an experiment planning and execution environment that captures the experimental 
design, the provenance of data in specific experiments, and electronic notes related to the 
experiments as a single web of RDF information that can subsequently be searched, viewed, and 
potentially harvested. The Collaboratory for Multiscale Chemical Science (CMCS) with which I 
have been involved provides a similar service for applications to record any and all information 
related to data and experimental procedures that has focused more on connecting information 
across scientific disciplines and related to dynamic community assembly and evaluation of 
reference data and associated computational tools. Many other examples could be cited - from 
ones like these that are primarily driven by the goal of directly increasing researcher and 
community productivity to those that are more specifically focused on the long-term curation of 
data. 

Working from the Web analogy, emerging semantic and content management technologies, 
and the exploratory projects using them, one can anticipate a period of rapid change in the 
curation and preservation of digital data and in the role of libraries. Very rich information 

- with all the detail captured and/ or used by all instruments and applications in scientific 
experiments - will be available via standard protocols in self-descriptive schema and directly 
available, given authorization, for inclusion in institutional repositories, community databases, 
reference collections maintained by scientific associations, etc. The information collected and 
any additional annotations generated by third parties will be transferable (thanks to unique 
identifiers) and the data/ metadata could be migrated, cached, replicated as needed by the 
organizations interested in it. Questions about what to collect may become much more graded 

- should the content be indexed only, should it be cached for performance, should it be copied 
to reduce risk or to extend the retention period, what metadata and ancillary data should be 
indexed, cached, copied along with the primary artifacts of interest? It is possible to imagine 
that different institutions may make very different choices in these areas to customize their 
solutions and provide added value for specific user bases, with or without global coordination 
or concepts as master copies or tiered collections. 

In planning for the next-generation of digital data curation and preservation capabilities, it is 
important to question our assumptions. While the expertise gained over centuries in curation 
and preservation will be central to robust solutions, it will be necessary to disentangle principles 
of information management from practices that actually represent compromise based on the 
current limits of technologies and organizational structures. Conversely, while technological 
progress will play a driving role, complex socio-technical issues will be faced in defining 
practical solutions that align with cultural and economic realities and are 'just complex enough' 
to serve society's needs. If the web analogy is broadly valid, we are about to enter a period of 
rapid progress, new ideas, and new partnerships that will dramatically change and improve our 
ability to understand the world's information. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Frank Rack, formerly Director, Ocean Drilling Programs, Joint Oceanographic 
Institutions; currently. Executive Director, ANDRILL (Antarctic geological Drilling) Program 

My views on the topic of the role of academic libraries in the preservation of digital data 
collections are primarily derived from a geosciences perspective, based on experiences from 
participating in and directing scientific drilling programs. These programs, which have enjoyed 
a long history of sustained support from NSF and international partners, have produced a 
substantial volume of physical samples (e.g., over 330 kilometers of sediment and rock cores) 
and data (both analog and digital) that are organized into structured collections using relational 
data base management systems as well as archives of unstructured data. These data holdings 
are expanding rapidly with the increasing acquisition of visual imagery and specialty data 
sets (e.g., volumetric imaging using X-ray CT and NMR/MRI) that are common to biomedical 
domains. Scientific drilling projects and programs cross NSF organizational boundaries 
spanning Earth, Ocean, and Polar Programs. These activities are now expanding to include 
the collection of observational time series across time and space with requirements for data 
streaming and remote user control of sensors and other resources. New models of sharing, 
storing, analyzing, and archiving data are required. Academic libraries have an important role 
to play in this new data universe. 

Academic Libraries as Collaboration Centers for Research, Education, and Public Outreach. 

Academic libraries are naturally the center of the campus knowledge management and 
information exchange process and can be enhanced with physical resource investments to 
become "amplified collaboration environments" serving as a focal point for cyberinfrastructure 
access on academic campuses. The past NSF investment in connecting academic campuses to 
the high-end research networks, such asInternet-2 and National Lambda Rail can be enhanced 
with campus-wide dark fiber networks that support broadband connectivity to campus 
buildings and facilities, like libraries, that may be remote from the IT-2 or NLR node, but 
are centers of learning. These dark fiber networks are owned and operated by the academic 
IT infrastructure and provide capabilities to link researchers, educators, and students in a 
distributed collaborative environment dedicated to knowledge management and information 
sharing supported by data discovery, analysis, and visualization tools that can be accessed 
through the campus library system. The University of California at San Diego has already taken 
this step and is currently operating a dark fiber network on their campus. 

These nodes could provide connectivity to the ACCESS GRID for collaborations with virtual 
research groups across campus or at other institutions and provide links with the DATA GRID 
for accessing computing resources and visualizations that are either pushed to the site or pulled 
from the site for research, educational or outreach activities. Investing in library infrastructure 
(hardware and software plus skilled staff) creates synergies with both local and remote research 
groups and encourages partnerships among researchers, academic staff and students through 
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opportunities for training and participation in focused demonstrations of the research outcomes 
using visualizations that can also be shared with the broader academic community and the 
general public. Academic libraries should be encouraged to establish collaborative relationships 
with local research groups and existing centers of excellence who provide digital content in 
exchange for data aggregation, public access, archiving and preservation services (either locally 
or remotely through networked data centers that provide tools and Web services that can be 
accessed easily by trained users), including publication of data. 

Academic Libraries as Partners in Preservation of Analog Reference Collections and 
Legacy Data. Academic libraries are uniquely positioned to play an important role in the 
preservation of analog reference collections and legacy data for projects that are aligned with 
the research and education mission of a particular academic institution or the mission of local 
centers of excellence on each campus. The network of academic libraries should coordinate 
with each other to collaborate on content preservation efforts that build on their strengths 
while minimizing duplication of effort on a national (or better yet, and international) scale. 
Domain specific groups of researchers, working as community agents, could work with 
designated academic libraries to digitize analog collections for long-term preservation with 
appropriate metadata. This type of partnership would combine the traditional strengths of 
library professionals with the opportunities provided by collocated teams of researchers in 
a coordinated way to support a large-scale, interoperable, networked architecture for data 
discovery, information sharing and knowledge creation. The academic libraries would play a 
fundamental role in supporting research and educational goals within the context of a pervasive 
cyberinfrastructure that would require partnerships between federal, state and local (academic 
institution and local community) for investments that leverage technology to provide access to 
knowledge resources and training and outreach to stakeholders at all levels. 

An example from scientific ocean drilling is the need to transform 40 years of analog 
"Proceedings" of the Deep Sea Drilling Project (DSDP) and Ocean Drilling Program(ODP) into 
electronic format through document scanning and OCR. These volumes weigh approximately 
2000 pounds as a set and contain both data and metadata that could serve a wide community 
of users if they were readily available across the network. Plans to undertake this scanning/ 
digitization project has been made and pilot studies have begun to transform this pile of paper 
into digital content through a partnership between the Texas A&M University Digital Library 
and the ODP Science Services group located at Texas A&M. Similar collections of key reference 
materials should be identified by specific domain science communities and prioritized for 
digital access in partnership with academic libraries. 

The evolution of a distributed, networked, partnership among academic libraries, technology 
centers and research groups would require strategic planning and phased investments 
that leverage existing programs and initiatives to create new opportunities and enhance 
synergies among all parties. The prior NSF investments in establishing point-source 
academic infrastructures (e.g., Internet2 and NLR connections /nodes) should be leveraged 
by establishing a process to encourage the construction of dark fiber campus-wide networks 
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connecting libraries to research centers to support data sharing, open access, and preservation/ 
archiving of data that can be provided to researchers, educators and communities of learners. 
The infrastructure investments should be combined with opportunities for the development 
of Web service architectures to support data discovery, analysis and visualization to create 
a transformational environment of innovation that would enhance knowledge creation and 
dissemination and stimulate learning. 

Observational data in the future will be streaming from thousands to millions of field sensors 
that will require scalable visualization resources to allow humans to readily understand and 
comprehend the significance of these data. Academic libraries have a unique opportunity 
to establish a strategic role for themselves as centers for data integration, analysis and 
visualization, combined with a strong education and outreach mission. In order to realize this 
dream, academic libraries will have to form innovative partnership with a variety of research 
groups and broad-based communities of educators and technologists who can translate these 
challenges into coordinated action plans that capitalize on the opportunities promised by this 
transformation. Academic libraries will become next generation centers for learning, education, 
and public outreach, and will need to provide training to a broad range of users to articulate the 
significance of the new world view that accompanies this change in the data universe. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Some Key Issues for Consideration 

Comments by Mark Sandler, Director, Center for Library Initiatives, CIC, Champaign, IL 

Archiving. Archivists in academic settings often comment on the lack of interest that scholars 
show in the notes and working files of their colleagues. Likewise, in the social sciences as 
well as other disciplinary clusters, there seems to be more reward in gathering new data than 
building on top of the sources of others. Nonetheless, we know this to be mixed, and ICPSR and 
other data centers probably have a sophisticated understanding of the kinds of data that does 
and doesn't support new and important research. We should be tapping into that knowledge 
to establish guidelines for archiving datasets, both at the level of which datasets should or 
shouldn't be saved, as well as which should be saved as bit streams with minimal investment 
as opposed to those worthy of being refreshed, migrated forward and kept readily accessible to 
subsequent generations of researchers. 

Aggregation. Aggregating data is efficient in terms of storage and management, and efficient 
as well in terms of retrieval. Perhaps disciplinary data farms should be developed, or perhaps 
this needs to be approached by funding agency or by academic institutions or collectives of 
academic institutions and research centers. A key to successful aggregation of data will be the 
emergence of standards around defining variables and achieving a degree of consensus about 
data gathering techniques that will permit greater comparability across studies. I understand 
that some differences in survey design and research methodology represent advances in a 
discipline, but I also understand that far too often these shifts are less about "progress" than 
idiosyncratic deviations masquerading as a research advance. 

In the world of text archives, there is increasing emphasis on standards, keyboarding and 
scanning guidelines, DTDs, and substantive metadata, all of which further content integration 
and system interoperability. The underlying theory here is that data gathered for a particular 
study or purpose might be more valuable as part of a larger whole than if self-contained and 
self-referential. 

Transparency. However it's done, data (especially data gathered with the assistance of public 
funding) should be more broadly available for public scrutiny and further analysis. I think we 
all respect the right of a researcher to have the time required for careful analysis of data he or 
she has gathered before it is opened up to the world. On the other hand, closing down access 
to useful data for many years on the off chance that the researcher will someday be inclined to 
return to the dataset seems selfish and not in the best interest of advancing scholarship. As with 
so many other issues in the underlying social relations of academe, we need to become clearer 
as a community about the social responsibility of scholars to engage in dialogue with the larger 
society. 



Long-term Stewardship of Digital Data Sets in Science and Engineering • 147 




Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by MacKenzie Smith, Associate Director for Technology, MIT Libraries 
There are two aspects of scientific and engineering data that relate to academic libraries: 

• Data as primary source material available for further research and experimentation, using 
particular datasets or groups of datasets; and 

• Data as part of enhanced publications that form the basis of modern, digital scholarly 
communication. 

Academic research libraries and archives are closely involved with both of these as a part of 
their mission and expertise. However, broadening the scope of libraries and archives to include 
digital scientific research data brings big challenges. There are unanswered questions about the: 

• Technical infrastructure, and who will develop and manage it; 

• Collection practices involving decisions about what data will be kept, when, in what form, 
with what tools, what description; 

• Digital preservation practices (of unknown difficulty and expense); and 

• Legal framework that is necessary to allow this to happen at all. 

Libraries and archives will probably not be the primary providers of the large-scale storage 
infrastructure required. Nor will they provide the specialized tools to work with the data 
(sometime at the level of individual datasets). They will also not provide detailed information 
about the data (which falls to researchers, or specialists from their societies and publishers). Nor 
will they provide the legal framework to enable open science. However to achieve economies of 
scale across all scientific research domains and not just create data silos within particular scientific 
sub-disciplines, there is value in library practices around: 

• Collection policies and practices (appraisal, selection, weeding, destruction, etc.); 

• Data clean-up, normalization, description, and submission to archives; and 

• Collaboration with researchers around scholarly communication practices of the disciplines 
(e.g., educating students about these practices, or helping researchers find appropriate 
archives or publications). 

It's unclear whether libraries will provide the technical solutions to long-term digital data 
preservation. It is certainly within the mission of research libraries and archives to preserve the 
scholarly record, but the technical challenges and costs involved are large, and libraries will 
need to invest seriously in this area if they wish to help find solutions. 

Finally, for "enhanced publications" that include scientific data as a useful part of networked 
documents, there are missing standards that academic libraries are well positioned to help 
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define, including: 

• Ontologies (for complex publications that include data); 

• Identifiers for publication parts that work across disciplines; 

• Consistent description practices for enhanced publications and their parts; 

• Data structuring conventions; and 

• Interoperability protocols for searching and retrieving data. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Eric F. Van de Velde, Director, Library Information Technology, California 
Institute of Technology 

Both large and small research projects produce data of historical value worthy of preservation. 
Large projects must incorporate data preservation as an essential part of their project. Small 
projects need institutional support to help them implement high-quality data-preservation 
policies. Funding agencies should create Centers of Excellence in Preservation and encourage 
peer review of data sets and associated services and software. Computer-and library-science 
curricula should include data preservation. 

The size and specialized nature of the data of large research projects (Caltech-MIT's LIGO, 
Human Genome Project. . .) require that data preservation is considered as an essential 
component of the project. These projects have the responsibility to manage the data, to 
make it available with appropriate services, and to preserve the data and the associated 
software and documentation. Often, the desired outcome is a data set that continues to grow 
indefinitely, supplemented with data from newer, more accurate, observations and experiments. 
Terminations of projects like these are relatively rare events that should be handled on a case- 
by-case basis as part of the closing-down process, which should be supervised and audited by 
the funding agency. 

Small research projects need institutional help with the work required to comply with 
preservation mandates. Managing scientific data requires scientific know how at an expert 
level, and the local research library cannot be expected to handle preservation of data of all 
disciplines. It might be feasible, however, for a library to specialize in the preservation of data 
in one or two disciplines. For the rest, the library is the agent between local researchers and 
specialized data archives. 

Funding agencies should fund (distributed) Centers of Excellence in Data Preservation, each 
specializing in a particular discipline. As part of the competitive funding process, interested 
institutions would develop collaborative organizational networks capable to implement 
effective preservation of specific data. This approach allows for organic growth, proportional to 
the actual needs. This approach also builds on the strengths of existing institutions (universities, 
research laboratories, and their libraries). 

Funding agencies should also provide incentives to accomplish more than "just archiving." Data 
obtained at great effort and expense should be made available as widely as feasible together 
with supporting services and software. Peer review of data sets and associated services and 
software would make it easier to consider this kind of work in tenure and promotion. Under 
suitable conditions, for-profit organizations could provide services using publicly available 
data, ensuring the use of this data for society's benefit. 



_ 
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Finally, we must ensure that the talent to preserve scientific data will be available. The preferred 
approach is to provide incentives for computer-science and library science departments to 
include suitable disciplines in their curricula. 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Todd Vision, Associate Director of Informatics, The National Evolutionary 
Synthesis Center, University of North Carolina at Chapel Hill 

The Respective Roles of Scientific Publishers and Institutional Libraries. Currently, the 
responsibility for management of data accompanying scientific publications falls to publishers 
in the form of supplemental data collections. Academic journals have little incentive to invest 
in the establishment and maintenance of digital data repositories that can be used for anything 
beyond minimal documentation of published reports. If there were a uniform and reliable 
system for digital data management that was hosted by researcher's home institution, this could 
potentially supplant the current system whereby unstructured supplemental data is deposited 
at publication. However, such a system would need to be available widely, and not just at elite 
institutions, in order to be a viable alternative for publishers. 

The Untapped Value of Raw Data to the Researcher. In many scientific fields, difficult-to 
obtain and essentially irreproducible datasets (such as long-term field observations in ecology) 
are analyzed and reanalyzed by the same research group over a period of many years, resulting 
in multiple publications, each containing only such statistical summaries of the data as are 
necessary to support the claims in the publication. Such datasets may continue to grow and 
accrete value over time. Understandably, the researcher feels entitled to an indefinite term of 
exclusive use, and there may be no single moment at which he or she would be comfortable 
providing even limited access to the full dataset. How common is this situation, and how 
worthwhile is it to invest in a system of scientific data preservation that would exclude such 
unique and valuable data collections? Or is there a way to manage such data while protecting, 
or alleviating, the researcher's concerns of exclusivity. 

The Burden of Metadata Curation. Digital data is nearly useless without extensive and high- 
quality metadata, both for resource discovery and for interpretability of the data itself. The 
researcher knows the data, but doesn't necessarily have sufficient expertise in information 
science to provide quality metadata curation, while librarians have the opposite problem. What 
incentives can be provided to researchers to undertake the burden of careful metadata curation, 
how can this task be made more manageable to nonexperts, and what incentive can be provided 
to institutions for QC of metadata produced by researchers? 
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Top Three Issues in the Long-Term Preservation, Management, and Curation of Scientific 
Data in Digital Form 

Comments by Tyler O. Walters, Associate Director for Technology and Resource Services, 
Georgia Institute of Technology, Library and Information Center 

Issues in Digital Data Curation Leading Us to the Need for New Partnerships 

Among the top three issues effecting the growth and future of digital data curation involves: 

1) Culture and policy frameworks; 2) Technology integration and data curation tools; and 3) 
Economic sustainability of our partnership and consortial models. 

The expense and enormity of the challenge to ensure the preservation, curation, and overall 
management of scientific digital data lends itself to a multi-institutional solution. However, in 
order to build effective partnerships, we must first begin with the single organization. Raising 
awareness of the need and benefits to managing digital data will be paramount. Universities 
will need to create programs and incentives that embed digital data curation in university 
culture, becoming an integral part of research projects. The initial change driver can be policy. 
Granting agencies such as NSF and NIH (DOE could be another source for a data curation 
policy) are considering data access and management policies that may drive universities with 
agency-funded research projects in this direction. In response, universities will need to make a 
commitment to the use, re-use, and maintenance of digital data. However, pragmatics also will 
dictate that not all data can be sustained. This situation requires designing criteria for selecting 
which sets of digital data will and will not become a long-term university responsibility. 
Libraries can contribute to the selection process by adapting archival appraisal theory as well 
as other parameters to judge which research materials are worthy of long-term accessibility. 

This combination of policy setting, awareness raising, cultural engineering, and selection 
criteria building, will become essential components in the rise of digital data curation programs. 
Libraries can assist with developing and implementing this agenda and be equal partners with 
scientists, advanced technologists, and policy makers. New partnerships and services akin 
to Genbank, operated by the NIH / NLM's National Center for Biotechnology Information, 
may need to be established. They may take on a discipline-based alignment, such as several 
universities with deep interests in astrophysics forming a consortium to provide data curation 
services for their home institutions' related data sets. 

Much work remains in areas such as improving technology integration and building reliable 
data curation tools. Data does not reside in just one information system; therefore, integration 
between systems is critical. Universities have research data residing in many applications such 
as databases (commercial and open), digital asset management systems, content management 
systems, and repositories. Curating this data consequently may include moving it from one 
system to another, linking it between systems, and migrating it to a central system. The exact 
future architecture is undetermined and many universities and consortia may take divergent 
paths. There will be common format and metadata portability issues. Ongoing development 
projects such as the Global Digital Format Registry (GDFR) can play an essential role in 
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matching formats (including database programs and protocols) and providing information on 
software that can read certain databases and data formats, and recommend migration paths. 
Data curators will need tools for data and metadata extraction, database emulation, data 
provenance tracking, and to document the origin, use, and re-use of data. Partnerships and 
consortia can play a key role. They can promote the further development of new data curation 
technologies as well as standards and technical protocols to ensure interoperability and data 
migration. They can be vital to maximizing present resources and strategies in generating these 
new technologies through their synergistic activities. 

If the work described above progresses, then the economic sustainability of these types of 
partnerships and consortia will become mandatory. The bottom line - funding and revenue 
streams need to be established. I mixture of funding sources will best guarantee the success of 
these new entities and help them to not become too reliant on any one source of funds. Partner 
dues, seed and project monies from grant agencies and private foundations, revenues from a 
variety of service and consulting fees, and several other creatively produced sources of funding 
are examples of cooperative ways to sustain the new partnerships. Those interested and vested 
in digital data curation should explore deeply new and dynamic models of organizational and 
economic sustainability. This is an opportunity to reinvent ourselves for the better as we face 
inherently new challenges in managing complex research objects such as data sets. 
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Appendix F. Examples of Scientific 
Community Archives 



Examples of Scientific Community Archives 
Rick Luce # Emory University 

A number of disciplines maintain archives with submissions from their communities. 
These are hosted under a variety of rubrics, including some publishers. Some of them 
include: 

arXiv (1991-), Physics, mathematics, computer science; main administration site at Cornell 
University, multiple mirrors worldwide, manages access to over 230,000 papers, abstracts 
include links to citation analysis for the paper by SLAC Spires and Citebase. 
http: / / www.arXiv.org/ . 

Citeseer (1998-, aka Researchlndex), developed at NEC Research Institute, NJ, USA, 
caches openly accessible full-text research papers on computer science found on the Web 
in Postscript and PDF formats for autonomous citation indexing, http: / / citeseer.nj.nec. 
com / cs. 

ebizSearch (2001-), administered by the eBusiness Research Center at Pennsylvania State 
University, based on Citeseer software, academic articles, working papers, white papers, 
consulting reports, magazine articles, published statistics, and facts, http: / / gunther.smeal. 
psu.edu/. 

K-theory Preprint Archives* \ (papers from 1995), managed at Mathematics Department, 
University of Illinois at Urbana-Champaign. http: / / www.math.uiuc.edu /K-theory/. 



Topology Atlas preprint server (papers from 1995), was most active in 1996 and 1997 
and still accepts submissions but suggests using the Mathematics Archive (arXiv.org or 
its Front) for distributing and finding preprints, hosted at York University, North York, 
Ontario, http: / / at.yorku.ca/ topology/ preprint.htm. 
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Algebraic Number Theory Archives (papers from 1996, frozen since Jan. 2003), hosted at 
the Mathematics Department, University of Illinois at Urbana-Champaign. http: / / www. 
math.uiuc.edu/ Algebraic-Number-Theory / . 

Mathematical Physics Preprint Archive, mp_arc (papers from 1991), hosted by 
Mathematics Department, University of Texas at Austin, http: / / rene.ma.utexas.edu/ mp_ 
arc / index.html. 

Hopf Topology Archive (papers from 1997), hosted by the Department of Mathematics, 
Purdue University, http: / /hopf. math.purdue.edu/. 

Preprints on Conservation Laws (papers from 1996), administered at Department of 
Mathematical Sciences, Norwegian University of Science and Technology (NTNU), 
Trondheim, http: / / www.math.ntnu.no/ conservation/. 

MCMC (Markov Chain Monte Carlo) methodology Preprint Service (papers from 1993), 

administered at the Statistical Laboratory University of Cambridge, http: / / www.statslab. 
cam.ac.uk/ -mcrnc/ index.html. 

Jordan Theory Preprint Archives (papers from 1996), hosted at Institut fur Mathematik, 
Universitat Innsbruck, http: / / mathematik.uibk.ac.at/ mathematik/ jordan/ index.html. 

Groups, Representations, and Cohomology Preprint Archive (papers from 1995), 

managed at Department of Mathematics, University of Georgia, USA. http: / / www.math. 
uga.edu/ archive.html. 

Field Arithmetic Archive, located at Ben Gurion University in Be'er-Sheva, Israel, stores 
electronic preprints on the arithmetic of fields, Galois theory, model theory of fields, and 
related topics, http: / / www.cs.bgu.ac.il/ research /Fields/. 

MGNet preprints (papers from 1991, last paper deposited 2001), Department of 
Computer Science, Yale University repository for information related to multigrid, 
multilevel, multiscale, aggregation, defect correction, and domain decomposition 
methods, http: / / casper.cs.yale.edu/ mgnet/ www/ mgnet-papers.html. 

Cogprints (1997-), an electronic archive for self-archived papers in any area of Psychology 
Neuroscience, and Linguistics, and many areas of Computer Science, Philosophy, Biology 

Medicine, Anthropology as well as any other areas pertinent to the study of cognition, 
initially a project in the JISC Electronic Libraries (eLib) Programme, administered by the 
IAM Group, University of Southampton, http: / / cogprints.ecs.soton.ac.uk/. 

E-LIS, E-Prints in Library and Information Science, http: / / eprints.rclis.org/. 
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DList, Digital Library of Information Science and Technology (October 2002-), managed 
by School of Information Resources and Library Science and Arizona Health Sciences 
Library, University of Arizona, http: / / dlist.sir.arizona.edu/. 

NetPrints Clinical Medicine and Health Research (December 1999-), BMJ Publishing 
Group and HighWire Press, a repository of non-peer reviewed original research, http: / / 
clinme d . ne tpr ints . or g / home . d tl . 

Chemistry Preprint Server (August 2000-), ChemWeb.com, Elsevier, http: / /www. 
chemweb.com/ preprint. 

Computer Science Preprint Server (November 2001-), Elsevier, http: / / www. 
compscipreprints.com/ comp /Preprint/ show/ index.htt. 

Mathematics Preprint Server (May 2001-), Elsevier, a registered data provider, supporting 
OAIv2.0. http: / / www.mathpreprints.com/ math /Preprint/ show/. (842 papers at 25th 
March 2003) 

HTP Prints, the History & Theory of Psychology Eprint Archive (September 2001-), 

administered at York University, Toronto, http:/ /htpprints.yorku.ca/. 

Education-line (1997-), a freely accessible database of the full text of conference papers, 
working papers, and electronic literature which supports educational research, policy and 
practice, initially a project in the JISC Electronic Libraries (eLib) Programme, administered 
by the Brotherton Library, University of Leeds, http: / / www.leeds.ac.uk/ educol/. 

Social Science Research Network (SSRN), Social Science Electronic Publishing, Inc., 
working papers and abstracts are provided by journals, publishers, and institutions for 
distribution through SSRN's eLibrary, which consists of two parts: a database containing 
abstracts on over 49,200 scholarly working papers and forthcoming papers, and an 
Electronic Paper Collection containing over 30,800 (27 March 2003) downloadable full-text 
documents. SSRN is composed of specialized research networks /journals in the social 
sciences: Accounting, Economics, Financial Economics, Legal Scholarship, Management, 
Negotiations. 

Electronic Colloquium on Computational Complexity (papers from 1994), led by the 

chair of theoretical computer science and new applications at the University of Trier. 

Research reports, surveys, and books in computational complexity, http: / / www.eccc.uni- 
trier.de/ eccc/. 

Cryptology ePrint Archive (2000-), maintained by the International Association for 
Cryptologic Research (IACR), incorporates contents of the Theory of Cryptology Library 
1996-1999. http: / / eprint.iacr.org/. 
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The Digital Library of the Commons (DLC), Indiana University, contains a Working 
Paper Archive of author-submitted papers, as well as full-text conference papers, 
dissertations, working papers, and pre-prints. (The commons is a general term for shared 
resources in which each stakeholder has an equal interest. Studies on the commons 
include the information commons with issues about public knowledge, the public domain, 
open science, and the free exchange of ideas.) http: / / dlc.dlib.indiana.edu/. 

Organic Eprints (September 2002-), established by the Danish Research Centre for 
Organic Farming (DARCOF), open access archive for papers related to research in organic 
agriculture, http: / / orgprints.org/. 

University of California International and Area Studies (UCIAS) Digital Collection 
(October 2002-), partnership of the University of California Press, the California Digital 
Library (CDL), and internationally oriented research units on eight UC campuses, 
publishes articles, monographs, and edited volumes that are peer-reviewed according 
to standards set by an interdisciplinary UCIAS Editorial Board and approved by the 
University of California Press, http: / / repositories.cdlib.org/ uciaspubs/. 

Formations, Faculty of Arts, University of Ulster, hosts eprints in Media Studies and 
participative "eLearning Forums" based on short discussion papers. Initially a project in 
the JISC Electronic Libraries (eLib) Programme, http: / / formations2.ulst.ac.uk/. 

Ecology Preprint Registry (papers from July 2001), hosted at the National Center for 
Ecological Analysis and Synthesis, dissemination of new research results destined for 
publication (i.e., not white papers or gray literature), only preprints with a theoretical 
basis can be submitted, the scope may be expanded to include submissions from the entire 
discipline of ecology, http: / / www.nceas.ucsb.edu:8504/ esa/ ppr/ ppr. Query. 



Note: This is a selection of the types of resources that are available and is offered for 
purposes of illustration. Inclusion in the list does not represent an endorsement nor should 
any meaning be inferred about resources that have not been identified here. 
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